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1.   INTRODUCTION 

Paged  virtual  memory  systems  (PVMS)  were  introduced  by  Kilburn 
et  al  [KILB62]  in  the  Ferranti  Atlas  computer.   The  objective  of  PVMS  is 
to  relieve  the  programmer  of  the  burden  of  storage  management.   However, 
the  cost,  in  terms  of  overhead,  and  performance  degradation  was,  at  the 
beginning,  thought  to  be  high  [FINE66, KUEH.68] .   Nevertheless,  others 
argued  in  favor  of  PVMS  [DENN70,SAYP69],  and  PVMS  are  well  accepted  and 
they  have  proved  to  be  useful  in  practice  [DENN70].   But  the  effectiveness 
and  the  overhead  leave  much  to  be  desired  [MASU72,0RGA72] .   This  is 
especially  true  in  terms  of  the  changing  technology  with  very  large 
memories  and  the  types  of  problems  with  large  data  bases  and  large  arrays. 

In  this  thesis,  we  are  concerned  with  the  performance  of  programs 
running  under  a  PVMS  and  with  techniques  and  algorithms  that  include 
prepaging  as  a  more  viable  solution  for  the  computer  systems  of  the 
future. 

In  Chapter  2,  we  introduce  the  terminology  used  and  survey  the 
relevant  literature.  We  identify  variables  of  a  PVMS  which  affect  the 
performance  of  its  programs.  We  discuss  the  question  of  what  performance 
measures  to  use.  We  then  study  the  effect  of  each  of  the  variables  of 
PVMS  on  the  performance,  from  the  available  literature.  We  focus  our 
attention  on  two  important  variables  of  a  PVMS,  namely,  the  paging 
algorithm  used  by  the  system  and  the  locality  of  the  programs.  We  note 
that  the  most  popular  paging  algorithms  are  demand  paging  algorithms 


chiefly  because  of  their  simplicity  in  implementation  and  because  little 
is  known  about  prepaging.  The  property  of  locality  has  been  observed  in 
practical  programs  and  is,  perhaps,  the  chief  reason  of  the  practicability 
of  PVMS  [DENN70],   For  special  applications,  people  have  given  methods 
to  write  "more  local"  programs.   Compiler  implemented  locality  improvement 
methods  have  restricted  them  to  code  and  not  data.   Further,  these  methods 
only  do  the  physical  reorganization  of  code  and  not  a  logical  reorganization. 

In  Chapter  3,  we  present  several  new  paging  algorithms,  same  of 
them  being  prepaging  algorithms.  We  show  why  prepaging  is  useful.  We 
present  a  new  algorithm,  DPMIN,  which  is  a  demand -prepaging  algorithm,  and 
prove  that  it  is  an  optimal  demand -prepaging  algorithm.   However,  DPMIN 
cannot  be  implemented  in  practice  since  it  requires  that  the  program's 
reference  string  be  known  in  advance.  We  also  present  several  practical 
prepaging  algorithms.  We  then  present  a  variable-memory  prepaging  algorithm 
called  PWS,  which  is  based  on  Denning' s  WS  algorithm  and  prove  that  it 
incurs  zero  page  faults.   PWS  algorithm  is  also  impractical  in  the  same 
sense  as  DPMIN  and  is  only  useful  for  theoretical  purposes.  We  also 
study  the  question  of  performance  measurement  while  using  these  prepaging 
algorithms,  in  particular,  we  study  whether  or  not  the  proposed  paging 
algorithms  are  stack  algorithms  [MA.TT70] . 

In  Chapter  h,   we  show  how  to  improve  the  locality  of  matrix 
algorithms.  This  includes  a  logical  reorganization  of  the  program  and 
its  application  to  several  common  matrix  algorithms.   We  also  measure 
and  compare  the  average  working  set  size  of  programs  and  show  that  this 
is  a  reasonable  way  to  measure  locality. 


In  Chapter  5>  we  introduce  prepaging  in  matrix  algorithms.   By- 
using  the  prepaging  algorithms  proposed  in  Chapter  3>  we  show  how  to 
improve  the  performance  of  the  matrix  algorithms. 

In  Chapter  6,  we  discuss  the  automation  of  the  performance-improvement 
techniques  of  Chapters  h   and  5.   Prepaging  applied  to  matrix  algorithms  can 
be  implemented  in  a  compiler  but  not  in  an  operating  system.   Eventually 
one  would  hope  that  these  techniques  would  find  their  way  into  some  of 
the  compilers  of  the  future  and  thereby  provide  efficient  prepaging  that 
cannot  be  obtained  in  other  ways. 


2.   GENERAL  PAGING 

2.0  Introduction 

In  a  paged  virtual  memory  system  (PVMS),  a  program's  address  space 
is  divided  into  fixed-size  blocks  of  addresses  called  pages  and  the  main 
memory  (MM)  is  divided  into  matching  size  blocks  of  locations  called 
page  frames. 

We  will  let  N  =  {1,  2,  . ..,  n)  denote  a  program's  address  space 
and  M  =  {1,  2,  . ..,  c}  denote  the  set  of  MM  page  frames  allot ed  to  the 
program,  and  generally  1  <  c  <  n.  When  a  program  is  executing,  it  makes 
a  sequence  of  references  to  its  address  space  N.  We  will  denote  such  a 
sequence  by  w  =  r,,  rp,  . ..,  r.,  ...  r  ,  where  r  e  N,  t  >  1.  We  will 

let  |w|  denote  the  number  of  references  in  w,  N  denote  the  set  containing 

T  iii 

the  null  string  A,  N  =  {w  =  v-f    rp,  . ..,  r  ,  ...  !  |w|  =  T),  W   denote 

the  set  of  all  finite  strings  on  N  and  N  =  N*  -  {A} . 

Now  since  the  CPU  (central  processing  unit)  can  only  refer  to 

the  MM,  we  must  interpose  a  mapping  mechanism  between  the  CPU  and  the  MM. 

The  address  map  f,  then,  is  a  function,  f :  N  -»  M.   Since  it  is  possible 

that  c  <  n,  f  may  be  a  partial  function  as  follows : 

r       j    if  page  i  resides  in  page  frame  j 
f(i)  = 

^   undefined  otherwise 


The  function  f  is  also  dynamic,  since  in  general,  a  program  may  refer 
to  any  of  the  pages  of  N.   If  the  page  size  is  p,  a  valid  virtual  address 
is  an  integer  a,  0  <  a  <  np.  Similarly,  a  valid  MM  address  is  an 
integer  £,  0  <  p  <  cp.  When  presented  with  the  virtual  address  Ot,    the 
address  translation  mechanism  (MAP)  obtains  (i,  £)   such  that  Oi   =  (i-l)  p  +  £ 
where  0  <  £  <  p,  and  then  generates  memory  address  p  =  [f(i)-l]  p  +  £   if 
f(i)  is  defined  and  a  'page  fault'  if  f(i)  is  undefined.  A  page  fault 
interrupts  the  execution  of  the  program  and  the  program  makes  a  transition 
from  running  state  to  the  page-wait  state.   If  we  assume  that  the  whole 
of  the  program's  address  space  resides  on  a  secondary  memory  device  such 
as  a  disc,  a  drum  or  a  bulk  core  store,  then  upon  the  occurrence  of  a 
page  fault,  the  memory  manager  has  to  bring  the  required  page  from 
secondary  memory  to  MM.   The  program  waits  for  the  required  page  for 
the  duration  of  one  page  fetch  time,  assumed  to  be,  on  the  average,  T 
time  units.   If  the  MM  was  originally  full,  the  above  process  will  also 
force  us  to  choose  a  page,  already  resident  in  MM,  to  be  replaced.   The 
process  of  page  replacement  may  also  involve  a  page  push  from  MM  to  the 
secondary  memory  if  it  was  written  into  during  its  last  stay  in  MM. 
After  the  required  page  has  been  brought  into  MM,  the  interrupted 
program  is  in  the  ready  state,  ready  to  execute  on  a  CPU  if  one  is 
available.   During  the  time  of  the  above  page  transfer,  the  CPU  will 
remain  idle  unless  we  resort  to  multiprogramming.   In  multiprogramming, 
we  try  to  mask  a  long  page  transfer  time  by  overlapping  I/O  and  execution. 
After  a  page  fault  interrupt,  the  CPU  is  allot ed  to  another  ready 
program  if  one  is  available.   If  the  system  does  not  allow  multi- 
programming, then  it  is  a  monoprogrammed  system. 


It  has  been  observed  in  practice  that  a  page  fault  incurs  a 
significant  amount  of  overhead.  As  an  example,  Masuda  [MASU72]  reports 
that  in  his  system,  more  than  50</0  of  running  time  of  the  operating  system 
is  spent  servicing  page  fault  interrupts  and  30%  of  the  time  is  a  pure 
idle  time  during  which  the  cpu  has  no  task  to  execute  and  is  waiting 
for  the  interrupt  of  the  I/O  termination.   The  situation  is  not  much 
different  in  other  PVMS  [0RGA72].   Thus  it  seems  that,  it  is  very  important 
to  study  the  performance  of  a  PVMS  and  find  methods  to  improve  performance. 
Improving  performance  of  a  program  running  under  PVMS  will  obviously 
improve  the  performance  of  the  PVMS.   We  will  investigate  the  performance 
of  programs  running  under  a  PVMS. 

2.1  Performance  Measures 

To  study  the  performance,  we  need  a  set  of  measures.   Furthermore, 
performance  as  seen  by  the  user  and  as  seen  by  the  system  are  two  different 
ways  to  study  this  topic.   From  the  point  of  view  of  system  efficiency, 
the  number  of  page  faults  ir,    the  number  of  page  pulls  C,  and  space  time 
product  ST,  of  MM  used  are  three  important  performance  measures  that  we 
will  consider.   From  the  user  point  of  view,  the  cost  of  running  the 
program  and  the  turnaround  time  are  important  performance  measures.   In 
a  monoprogrammed  system,  user  oriented  measures  are  directly  related  to 
the  number  of  page  faults,  thus  in  such  a  system  it  will  be  sufficient 
to  study  the  page  fault  measure.   In  a  multiprogrammed  system,  user 
oriented  measures  cannot  be  so  easily  related  to  the  system  oriented 
measures.   In  most  systems  page  faults  or  page  pulls  do  not  occur 
explicitly  in  the  cost  charged  to  the  user.   However,  a  few  milliseconds 


7 
of  cpu  time  per  page  fault  is  generally  required.   Since  the  cost  is 
generally  determined  by  the  cpu  time  used  and  memory  usage,  we  see  that 
the  cost  is  affected  by  page  faults  and  the  ST  product.   In  a  similar 
way,  the  turnaround  time  is  affected,  at  least  indirectly,  by  the  number 
of  page  faults  incurred.   Because  of  these  reasons,  we  will  consider 
only  the  system  oriented  performance  measures. 

A  page  fault  implies  the  following  sequence  of  operations: 
interruption  of  the  program  and  a  switch  to  the  page  fault  handler, 
search  for  the  required  page  on  the  secondary  memory  device,  search  for 
a  page  to  be  replaced,  issuing  of  output  instructions  to  push  the  page 
to  be  replaced  (if  required),  issuing  of  input  instructions  to  fetch  the 
required  page,  searching  the  ready  queue  and  finally  a  process  switch  to 
a  new  program  from  the  ready  queue.  All  these  operations  take  considerable 
amount  of  time,  in  fact,  a  few  milliseconds  per  page  fault  [MASU72,  ORGA'72] . 
Thus  if  we  want  to  improve  the  cpu  utilization,  we  must  minimize  the 
number  of  page  faults.  Another  effect  of  a  page  fault  is  poor  memory 
utilization.   For  instance,  during  the  page  wait  time  and  also  during 
the  time  the  process  spends  in  the  ready  queue  waiting  for  a  cpu,  the 
page  frames  of  MM  allot ed  to  the  program  are  unavailable  to  other  processes. 

The  number  of  page  pulls  has  a  direct  relationship  to  the  channel 
traffic.  With  a  given  system,  the  maximum  amount  of  channel  traffic  that 
it  can  support  is  fixed.   This  implies  that  the  page  pulls  incurred  by  an 
individual  program  is  an  important  measure  of  its  behavior.   In  multi- 
programmed  systems  there  is  an  extreme  situation,  called  thrashing,  in 
which  excessive  page  traffic  occurs  [DENN68] .   Thrashing  has  the  effect 
of  rapidly  decreasing  cpu  utilization,  and  it  is  therefore  important  to 
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eliminate  it  if  possible.   Note  that,  if  we  use  demand  paging,  then  the 
page  fault  and  the  page  pull  measures  became  indistinguishable.  A  related 
performance  measure  is  the  number  of  page  pushes;  however,  it  is  a  fixed 
percentage  of  the  number  of  page  pulls,  on  the  average. 

During  the  page-wait  time,  the  set  of  pages  alloted  to  the 
interrupted  program  is  wasted.   If  these  pages  were  alloted  to  the 
currently  executing  program,  we  could  improve  its  performance.   Therefore, 
the  utilization  of  MM  can  be  measured  by  the  space  time  product  of  MM 
used  by  a  program  during  the  entire  time  of  its  residence  in  MM. 

2.2  Important  Variables 

There  are  seven  important  variables  in  a  PVMS  which  can  affect 
the  performance.   These  are: 

1.  Whether  the  system  is  monoprogrammed  or 
multiprogrammed . 

2.  The  number  of  allocated  page  frames  c. 

3.  The  page  size  p. 

k.      The  average  page  fetch  time  T,  and  the  ratio 
T/a  where  A  is  the  memory  cycle  time. 

5.  The  memory  referencing  behavior  of  the  program, 
which  is  reflected  in  the  page  reference  string  oj. 

6.  The  organization  of  the  program's  information 
(both  instructions  and  data)  in  the  virtual 
address  space. 

7.  The  paging  algorithm  used. 


There  are  four  controllers  of  these  seven  variables:  hardware, 
operating  system,  compiler  and  the  programmer  [WEIN72]. 

The  hardware  controls  the  page  size  p,  the  average  page  fetch 
time  T,  the  memory  cycle  time  Z\,  and  the  total  amount  of  MM  available. 
In  some  systems  it  also  controls  the  paging  algorithm  used  [LIPT68], 

The  operating  system  controls  the  degree  of  multiprogramming  used, 
the  number  of  allocated  page  frames  c  and  whether  or  not  c  is  allowed  to 
vary  dynamically,  and  finally,  it  determines  the  paging  algorithm  used. 

The  compiler  decides  the  page  reference  string,  at  least  partially, 
insofar  as  the  translation  process  allows  the  freedom  to  do  so.   Storage 
allocation  phase  of  the  compiler  organizes  a  program's  information  in 
the  virtual  address  space. 

The  programmer  may  control  the  page  reference  string  and  to  a 
certain  extent,  the  organization  of  the  program's  information  in  the 
virtual  address  space,, 

The  effect  of  these  seven  variables  on  the  performance  will  be 
considered  now  in  more  detail. 

2.2.1  Multiprogramming 

We  have  observed  that  multiprogramming  is  used  to  overlap  the 
page  transfers  for  one  program  with  the  execution  of  another  program,  and 
as  a  result  it  has  a  potential  of  improving  the  cpu  utilization.  We  note 
that  multiprogramming  also  has  other  uses,  such  as,  to  provide  time 
sharing,  but  in  this  work  we  concern  ourselves  with  the  former  use  only. 
The  number  of  programs  occupying  the  MM  concurrently  is  known  as  the 
degree  of  multiprogramming.   The  basic  questions  in  such  a  system  are 
what  degree  of  multiprogramming  to  use  and  how  much  improvement  in 
cpu  utilization  is  accrued  as  a  result.   If  a  very  low  degree  is  used, 
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we  may  not  be  able  to  find  a  ready  job  to  run  on  the  occurrence  of  a 
page  fault;  as  a  result  the  cpu  will  remain  idle.   On  the  other  hand, 
whenever  several  concurrent  programs  share  MM  in  order  to  'mask'  I/O 
time  each  program  gets  a  smaller  number  of  MM  page  frames  which  tends  to 
increase  page  faulting.   In  an  excellent  survey  article,  Kuck,  et  al. 
[KUCK70]  discuss  this  question  in  great  depth.   Their  conclusion  is  that 
for  multiprogramming  to  yield  a  reasonable  gain  in  cpu  utilization, 
there  must  be  'sufficient'  amount  of  MM.   They  have  also  shown  that  with 
other  variables  being  fixed,  the  cpu  utilization,  generally,  goes  up 
starting  from  a  degree  of  multiprogramming  equal  to  one  (i.e.,  monopro- 
gramming) until  it  reaches  a  maximum  and  then  it  falls  down  very  quickly. 
If  the  degree  of  multiprogramming  is  very  high,  the  system  is  susceptible 
to  thrashing,  a  collapse  of  performance  that  may  occur  when  MM  is 
overcommitted  [DENN68,  DENF70] .   Denning  has  proposed  that  by  using  a 
working  set  scheduling  strategy,  thrashing  can  be  avoided.  We  will  not 
discuss  the  question  of  degree  of  multiprogramming  any  further. 
Interested  readers  are  referred  to  [KUCK70,  DEMT70,  COFF73,  SMIT67] . 

2.2.2  Main  Memory  Allotment 

The  effect  of  MM  allotment  on  page  faults  has  been  studied  in 
a  number  of  papers  [BELA.69,  DENN68,  DEMf68a,  KUCK70,  LEHM68,  OPPE68, 
SHEM66,  SHEM69,  SISS68,  SMIT67,  FINE66,  MATT70,  SALT7U],  and  many 
experiments  have  been  conducted  to  determine  program  paging  behavior 
[ANA.C67,  BAER71,  BAYL68,  BEIA66,  BEAW68,  COFF08,  DEM70,  FINE66,  FREI68, 
J0SE70,  LEHM68,  ONEI67,  PINK68,  SALT7^,  STEV68,  VARI68] .   Conclusion  of 
all  these  studies  seems  to  be  that  the  paging  rate  is  very  high  at  low 
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page  allotments  and  it  reduces  very  fast  as  the  allotment  is  increased. 
After  an  amount  of  page  allotment  called  'parachor'  [KUEH68]  is  reached, 
the  decrease  in  paging  rate  achieved  by  increasing  c  is  not  very 
significant.   The  value  of  c  at  parachor  is  dependent  on  the  particular 
program  under  consideration.   This  immediately  brings  us  to  the  point  of 
fixed  versus  variable  size  page  allotment. 

In  a  monoprogrammed  system,  page  allotment  is  fixed  and  such  a 
question  does  not  arise;  however,  if  we  have  a  multiprogrammed  system 
then  control  could  be  exercised.   It  has  been  shown  that  variable  memory 
allotment  is  generally  superior  to  fixed  memory  allotment  [BELA.69,  C0FF72, 
DENM72]. 

2.2.3  The  Page  Size 

The  effect  of  page  size  p  on  the  performance  of  a  PVMS  has  been 
extensively  studied  in  the  literature  [BAER71,  DEM70,  DOYL  ,  GELE71, 
GELE73,  HATF72,  J0SE70,  KUCK70,  RAND68,  RAND69] .   There  are  several 
factors  which  affect  the  choice  of  a  page  size.   Operating  system 
typically  rounds  the  number  of  pages  up  to  an  integral  number  of  pages 
which  results  in  unused  words  called  internal  fragmentation  [RAND69] . 
Reducing  the  page  size  will  reduce  internal  fragmentation.   Kuck  [KUCK70] 
defines  superfluity  to  be  the  unreferenced  words  on  a  page  during  the 
interval  of  residence  of  the  page  in  MM.   Doyle  [DOYL  ]  has  called  this 
gap  fragmentation.   Reducing  the  page  size  tends  to  minimize  this  problem. 
By  decreasing  the  size  of  the  page,  more  space  must  be  kept  in  the  page 
table  which  leaves  less  MM  available  for  program  paging.   This  effect  is 
called  table  fragmentation  [DEM70,  DOYL  ] .  A  balance  should  be  achieved 
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to  minimize  these  fragmentation  problems.   Note  that  fragmentation  only- 
affects  the  memory  utilization.  There  are  other  effects  of  the  page  size. 
With  a  fixed  MM  size,  reducing  page  size  seems  to  have  beneficial  effect 
on  the  page  faults  [ANAC67,  BAER71,  BAYL68,  COFF68,  J0SE70] .   Decreasing 
the  page  size,  however,  increases  cpu  overhead,  the  hardware  cost  of  the 
address  translation  mechanism,  the  channel  traffic  and  average  word  fetch 
time  [DENN70],  We  will  not  study  the  question  of  page  size  in  this  work. 

2.2.4  The  Page- Fetch  Time 

The  effect  of  the  average  page  fetch  time  T  on  the  performance  of 
a  PVMS  is  the  next  point  to  be  discussed.   In  a  monoprogrammed  system, 
the  cpu  is  idle  during  this  time,  so  a  large  T  implies  a  very  low  cpu 
utilization  and  also  a  large  turnaround  time.   In  a  multiprogramming 
system,  the  effect  of  a  large  T  is  masked  by  overlapping  page-wait  time 
of  one  program  with  the  execution  of  another.   But,  as  we  have  seen, 
multiprogramming  implies  memory  sharing,  which  in  turn  implies  more  page 
faulting.   If  we  avoid  memory  sharing  and  resort  to  swapping  then  the 
channel  traffic  is  increased. 

Kuck  [KUCK70]  has  concluded  that  for  T  >  6000  units,  there  is  no 
gain  to  be  had  from  multiprogramming.   Here  T  is  measured  in  average 
instruction  time  units  (roughly  microseconds).   Roughly  speaking,  this 
means  that  movable  head  disc  is  a  very  bad  paging  device.   Drums  and  fixed 
head  disc  fall  below  the  6000  mark,  and  are  useful.   Extended  core 
storage  (ECS)  provides  T  <  1000  and  is  an  excellent  paging  device.   In 
fact,  with  ECS,  it  has  been  recommended  that  the  process  switch  after 
a  page  fault  be  avoided  altogether  and  the  cpu  be  left  idle  during  the 
page-wait  time  [0EGA72]. 


13 
Denning  [DEOT68]  reports  that  the  main  reason  for  thrashing  can 
be  traced  to  a  large  value  of  T.   For  a  further  discussion  see  [DEM70] . 

2.2.5  Program  Behavior 

The  memory  referencing  behavior  of  a  program  is  a  very  important 
variable  in  determining  performance  of  a  PVMS.  A  program  which  scatters 
its  references  over  the  entire  address  space  will  page  fault  heavily.   It 
is  therefore  desirable  that  the  program  tend  to  refer  to  a  small  subset 
of  its  address  space  in  a  small  time  interval.   Fortunately,  the  above 
property,  known  as  the  locality  property,  is  found  to  be  exhibited  by 
most  programs  in  practice  [BELA.66,  COFF73,  DEM68,  DFJW70,  DEM72a, 
FINE66,  LIPT68,  BRAW68,  BRAW70,  MCKE69,  KUEH68,  SAYR69,  SHED72],   In  fact, 
this  property  of  locality  is  the  key  reason  of  the  feasibility  of  PVMS. 

More  specifically,  the  property  of  locality  can  be  summarized 

in  three  statements  [DEM72a]  : 

During  any  time  interval,  a  program  distributes  its 
references  nonuniformly  over  its  address  space,  some 
pages  being  favored  over  the  others. 

The  density  of  references  to  a  given  page  changes 
slowly  in  time  or  the  set  of  favored  pages  changes 
membership  slowly. 

Two  disjoint  segments  of  the  page  reference  string 
tend  to  be  highly  correlated  when  the  interval  between 
them  is  short,  and  tend  to  become  uncorrelated  as  the 
interval  between  them  increases. 

It  is  clear  that  the  behavior  and  style  of  the  programmer  has  a 

direct  bearing  on  the  locality  of  programs.   Programmers  tend  to  use 

sequential  and  looping  control  structures  and  they  tend  to  concentrate 

on  small  sections  of  large  programs  for  moderately  long  time  intervals, 

and  they  generally  group  data  into  content-related  blocks. 
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The  set  of  favored  pages  seem  naturally  to  split  into  four  areas 
of  activity  [GIBS66,  J0SE70] .   One  of  these  areas  is  constituted  by  the 
instruction  addresses  of  the  program.   The  other  three  areas  are 
constituted  by  data  addresses.   The  processes  of  data  analysis  often 
consist  of  arithmetic  or  logical  manipulations  on  two  or  more  strings  of 
independently  (semi)  coherent  addresses  which  in  combination  form  the 
sequence  of  operand  addresses.   The  fourth  area  of  activity  is  the  output 
catchment  area  where  the  results  are  held  prior  to  outputting.   If  it  is 
not  possible  to  study  the  program  behavior  in  this  great  a  detail,  it 
may  be  profitable  to  separate  the  set  of  favored  pages  into  data  pages 
and  instruction  pages  [DEM72],  We  will  investigate  this  topic  in  detail 
in  this  thesis.  We  remark  that  for  problems  with  large  data  bases,  the 
instruction  paging  is  almost  trivial  compared  with  the  data  paging. 
Therefore,  we  will  concentrate,  for  the  most  part,  on  the  data  paging 
problem  in  this  thesis. 

Examples  of  what  may  happen  to  performance  of  a  program  with 
poor  locality  are  too  numerous  [FINE66,  BRAW68,  DEM65,  BAIR68,  SMIT67] . 
Therefore,  it  is  very  important  that  methods  be  developed  to  create 
guidelines  on  how  to  write  programs  with  a  high  degree  of  locality. 
Before  we  discuss  these  locality  improvement  methods,  we  discuss  several 
other  results  on  program  behavior. 

To  study  and  predict  performance  of  programs  running  under  FVMS, 
several  models  of  program  behavior  have  been  investigated.   These  models 
are  used  to  generate  reference  strings  to  be  used  in  the  analytical  study 
of  program  behavior.   The  validation  of  a  proposed  model  is  another 
question  that  need  be  answered.   The  simplest  of  all  models  investigated 
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is  the  independent  reference  model,  which  has  been  studied  by  several 

authors  [COFF73,  KING71,  FRMTfk,   AH071,  BELA.66] .   In  this  model,  it  is 

assumed  that  the  probability  of  a  reference  to  page  i  at  time  t  is  given 

"by,  P  [r,  =  i]  =  p.   V  t  >  1   V  1  <  i  <  n,  where  the  set  of  all  p.  i 
r  "0        l  -i 


is 


n 


fixed  and  Z  p.  =  1.   It  has  been  observed  that,  this  model  is  relatively 
1=1  1 

simple  to  analyze  but  it  does  not  mirror  the  behavior  of  actual  programs 
[COFF73] .  Aho  et  al  [AH071]  proposed  a  very  general  ^-order  nonstationary 
model  of  program  behavior.   However,  except  in  0-order  stationary  case 
(which  is  identical  to  the  independent  reference  model),  very  few  results 
are  obtained  for  this  general  model.   Denning  et  al  [DENN72a]  propose 
several  locality  models  of  program  behavior.   They  were  able  to  produce 
some  interesting  results  with  these  models  which  compared  favorably  with 
the  behavior  of  practical  programs.   Matt son  et  al  [MATT70]  propose 
an  LRU  stack  model  of  program  behavior  which  was  improved  and  modified 
by  Shedler  et  al  [SHED72].   In  the  LRU  stack  model,  it  is  assumed  that 
the  locality  L  of  the  program  at  time  t  is  of  a  fixed  size  c,  equal  to 
the  page  allotment,  and  consists  of  the  last  c  pages  referred  by  the 
program.   Quite  clearly,  this  is  a  good  model.   Saltzer  [SALT7^]  has 
recently  proposed  a  simple  linear  model  of  program  behavior  which  is 
validated  against  the  MULTICS  system.   This  model  assumes  that  the  page 
fault  probability  is  inversely  proportional  to  the  page  allotment  c. 

Several  authors  have  given  guidelines  to  produce  programs  with 
better  locality  [BRAW68,  DENF70,  F1NE66,  HATF71,  KUEH68,  WEIN72].   The 
use  of  modular  and  highly  structured  programming  and  programming  languages 
is  a  suggestion  made  most  often.   The  improvement  of  locality  by  proper 
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arrangement  of  cede  and  data  in  VA  space  is  a  problem  we  will  discuss 
separately.   Here  we  are  concerned  with  logical  organization  of  the 
program.  When  asked  to  solve  a  particular  problem,  a  programmer,  generally, 
has  a  choice  of  a  variety  of  algorithms  to  use.  As  an  example,  if  he  is 
to  solve  a  system  of  linear  equations,  he  can  use  Gaussian  elimination, 
LU  decomposition  or  Cholesky  factorization  and  many  others  [ISAA66].   The 
choice  of  the  algorithm  is  based  upon  the  mathematical  properties  of  data; 
therefore,  we  expect  the  locality  to  vary  with  each  algorithm.   Even  when 
a  particular  algorithm  is  chosen,  it  may  be  possible  to  change  the  order 
of  some  operations  without  affecting  the  algorithm  drastically.  We  would 
like  to  consider  the  order  of  operations  which  yields  the  maximum  locality. 
In  order  to  discuss  this  topic  in  detail,  we  have  to  consider  individual 
applications.   For  applications  involving  sorting,  see  [BRAW70],  for 
searching  problem  see  [KNIG  ],  and  for  the  list  processing  application 
see  [BOBR67] .   In  the  case  of  matrix  algorithms,  McKeller  et  al  [MCKE69] 
have  shown  that  block  algorithms  have  a  superior  locality  property 
compared  with  nonblock  algorithms.   Lubrulle  [DUBR72]  has  discussed  the 
solution  of  the  Eigenvalue  problem  in  a  paged  environment.  Moler  [M0LE72] 
has  discussed  a  method  of  loop  reversal  for  improving  matrix  algorithms. 
Rogers  [ROGE73]  discusses  the  solution  of  linear  equations  in  a  paged 
environment.  We  will  discuss  methods  of  improving  locality  of  matrix 
algorithms Q   Our  work  will  be  distinguished  from  that  of  the  above  by  the 
fact  that  we  will  be  studying  many  more  matrix  algorithms  and  that  our 
measurements  will  be  more  extensive.   In  particular,  we  will  study  matrix 
multiplication,  Cholesky  decomposition,  EU  decomposition,  Gaussian 
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elimination  and  Gram  Schmidt  orthonormalization.  We  note  here  that,  we 
choose  to  ignore  the  error  growth  aspect  of  our  algorithms  unlike  Rogers. 
We  will  measure  the  paging  performance  of  our  algorithms  under  a  variety 
of  paging  algorithms,  both  realizable  and  unrealizable.  As  against  this, 
McKeller  et  al  have  only  measured  the  performance  of  their  algorithms 
under  MIN  [BELA.66]  paging  algorithm,  which  is  unrealizable. 

So  far  we  have  discussed  only  programmer-implemented  locality 
improvement  methods.   One  of  the  motivations  for  adopting  a  virtual  memory 
system  is  that  it  relieves  the  programmer  from  the  burden  of  memory 
management  (or  overlay  problems)  when  his  program  cannot  fit  into  the 
available  MM  [DENN70,  KILB62],  therefore,  asking  the  programmer  to  worry 
about  locality  of  his  programs  seems  to  be  a  step  backwards.   The  job 
of  improving  locality  of  programs  must  then  be  carried  out  by  the  computer 
system.   The  operating  system  cannot  possibly  do  much,  since  it  does  not 
know  the  global  structure  of  the  program.   Therefore,  it  appears  plausible 
that  the  optimization  we  seek  can  be  achieved  by  the  compiler.  We  know 
of  no  earlier  work  along  these  lines.   Observe  that,  we  are  considering 
the  topic  of  logical  program  reorganization  and  not  physical  reorganization. 
We  will  consider  this  type  of  compiler  optimization  for  matrix  algorithms 
written  in  both  a  high-level  language  (FORTRAN  or  PL/l)  and  an  array 
language  OL/2  [PHIL72].  We  remark  that  the  OL/2  language  is  an  array 
language  which  allows  dynamic  partitioning  and  block  structure,  but  it 
does  not  allow  a  'GOTO'  statement.   These  characteristics  greatly  influence 
the  program  reorganization  strategies. 
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2.2.6  Pagi nation 

The  organization  of  a  program's  information  in  the  VA  space  has 
been  termed  pagination  [DEM70] .   This  can  be  further  divided  into  code 
(instruction)  pagination  and  data  pagination.   In  code  pagination,  at  the 
highest  level  one  could  study  the  grouping  of  subroutines  to  minimize  page 
faults.   Such  experiments  were  made  by  Comeau  [COME67].   Code  pagination 
can  be  done  on  the  basis  of  statistical  properties  (e.g.,  branch 
probabilities)  [FAMA66,  DEHN70],  the  history  collected  from  previous 
runs  [HATF71,  FERR73],  or  solely  on  the  basis  of  syntax  [YEL071] .   The 
problem  of  data  pagination  has  received  very  little  attention.   In 
particular,  only  for  special  applications  like  matrix  algorithms  have 
investigations  been  made.   It  has  been  noted  by  several  authors  [MCKE69, 
ROGE73]  that  storing  matrices  by  square  blocks  coupled  with  the  use  of 
block  algorithms  seems  to  improve  paging  performance.  We  will  assume 
block  (or  submatrix)  type  of  storage  for  matrices. 

2.2.7  Paging  Algorithms 

We  now  discuss  paging  algorithms  and  their  effect  on  paging 
performance.  A  paging  algorithm  implements  three  policies.   The  fetch 
policy  determines  when  a  particular  page  should  be  brought  into  MM;  a 
replacement  policy  determines  when  a  particular  page  is  removed  from  MM; 
and  a  placement  policy  determines  an  available  page  frame  to  hold  a 
fetched  page.   In  a  PVMS  (as  against  a  segmented  VMS),  the  placement 
policy  is  trivial  and  therefore,  will  not  be  considered  [DENF70] . 

Basically  there  are  two  types  of  fetch  policies,  namely,  demand 
fetching  and  anticipatory  fetching.   Under  demand  fetching,  a  page  is 
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"brought  into  MM  only  on  demand,  i.e.,  when  a  page  is  referenced  and  is  not 
found  to  be  resident  in  MM  (i.e.,  at  the  time  of  a  page  fault).   Under 
anticipatory  fetch  policy  (also  called  prefetching),  one  or  more  pages  may 
be  brought  into  MM  at  any  time,  usually  in  advance  of  reference  to  the 
page.  A  special  kind  of  prefetching  policy  is  called  demand  prefetching 
[COFF73].  Assume  that  at  time  t,  a  page  x  is  referenced  and  is  not  found 
in  MM,  thus  creating  a  page  fault.  As  a  result  a  page  fetch  for  page  x 
will  be  initiated.  A  demand  prefetch  policy  will  allow  us  to  fetch  some 
other  pages  together  with  page  x  in  the  above  situation. 

A  replacement  policy  can  also  be  one  of  two  types :  demand  an 
anticipatory. 

Most  popular  paging  algorithms  are  of  demand_fetch,  demand_replace 
variety  and  are  commonly  known  as  demand  paging  algorithms.  All  the  other 
varieties  of  paging  algorithms  are  commonly  known  as  prepaging  algorithms. 
Demand  paging  has  been  widely  used  in  practice  due  to  its  simplicity  in 
implementation.   Prepaging,  on  the  other  hand,  is  more  difficult  to 
implement,  chiefly  because  good  predictions  of  a  program's  pages  needed 
in  future  are  not  easily  obtained  [DEM70,  DEM72], 

Two  types  of  prepaging  schemes  have  been  reported.   One  scheme, 
known  as  swapping,  has  been  used  to  guard  against  excessive  paging  due  to 
a  small  time  quantum  (e.g.,  time- sharing  systems)  [DEM70,  KUEH68,  0RGA72]. 
In  this  scheme  (except  for  the  first  time  quantum)  a  program's  working 
set  is  preloaded  at  the  beginning  of  the  time  quantum  and  demand  paging 
is  used  during  the  time  quantum.  We  note  that  the  working  set  to  be 
preloaded  is  determined  from  the  set  of  pages  the  program  had  acquired 
during  the  last  time  quantum.   This  scheme  is  successful  with  small  time 
quantum. 
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The  second  scheme,  proposed  by  Joseph  [J0SE70],  is  as  follows. 

Whenever  a  page  fault  occurs  for  page  x,    then  fetch  page  x  and  x+1.  This 

scheme  is  particularly  successful  when  the  address  pattern  of  a  program 

is  sequential. 

Prepaging  incurs  less  page  faults  than  demand  paging,  but  the 

number  of  page  pulls  may  be  the  same  or  even  larger  because  of  the 

possibility  of  an  incorrect  prediction  [J0SE70],   The  effect  of  prepaging 

on  ST  product  of  MM  used  is  not  clear  from  the  literature.   Joseph  reports 

an  increase  in  ST  product  whereas  Denning  [DEM70]  shows  that  if  the 

probability  of  incorrect  prediction  is  low,  then  ST  product  is  reduced 

by  prepaging.   Because  of  a  reduction  in  total  page  wait  time,  ST  product 

tends  to  reduce  but  at  the  same  time  it  tends  to  increase  because  pages 

are  brought  in  MM  in  advance  of  their  use.  We  will  discuss  this  question 

in  Chapter  3.   If  more  than  one  page  is  brought  into  MM  at  the  same  time 

and  a  proper  layout  of  pages  on  the  secondary  memory  is  used,  then  the 

effective  access  time  per  page  (t)  can  be  reduced,  provided  rotating 

secondary  memory  is  used.   The  basic  question,  then  seems  to  be  that  which 

performance  measure  is  more  important,  page  fault  or  page  pull.   There  is 

no  easy  answer  to  this  question.   However,  if  the  goal  is  better  cpu 

utilization  then  page  fault  is  a  more  important  measure  and  if  reduction 

of  the  channel  traffic  is  the  goal,  then  page  pull  is  a  more  important 

measure.   It  seems,  therefore,  that  prepaging  is  justifiable  if  good 

predictions  are  available.  We  will  elaborate  on  these  ideas  in  subsequent 

chapters. 

2.3  The  Formalism  of  Paging  Algorithms 

In  the  abstract,  a  paging  algorithm  A  is  a  mechanism  for  processing 
a  reference  string  w  =  r,  r~  ...  r  ...  and  generating,  in  response,  a 
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sequence  of  memory  states  S_  S,  . . .  S  . ..  where  SQ  is  a  specified 
initial  memory  state  [AH071,  COFF73] .   Each  memory  state  Sis  the  set 
of  pages  from  N  which  reside  in  M  at  time  t;  they  satisfy  the  conditions 
S,  c  N,  |S, I  <  c,  r,  e  S   (t  >  0).   Moreover,  S  and  S  .  are  related  by, 
S,  =  S,    +  X  -  Y  where  X,  c  N  -  S,  ,  is  the  set  of  pages  fetched  at  time 
t  and  Y.  c  S,  ,  is  the  set  of  replaced  pages.   To  determine  X,  and  Y  at  time 
t,  the  paging  algorithm  must  maintain  internal  records.   A  set  of  control 
states  Q  is  used  for  this  purpose,  q  being  a  designated  initial  control 
state.   A  configuration  of  the  paging  algorithm  is  any  pair  (S,  q)  in  which 
|s|  <  c,  q  e  Q.   Associated  with  the  algorithm  A  is  a  transition  function  g  , 
such  that  gA(S,  q,  x)  =  (S',q'),  x  e  S',  where  (S,  q)  is  the  present 
configuration  and  (S ' ,  q')  is  the  next  configuration  and  x  is  the  current 
page  reference  causing  the  transition.   In  particular,  the  memory  state 
sequence  Sn  S  ...  S  ...  induces  a  configuration  sequence  (S_,  q  ) (S, ,  q,  )  . . 
(S,,  q,  )  ...  generated  by, 

{St>    qt}  =  gA(St-l'  %-l>    rt^  t^1' 

A  paging  algorithm  is  a  demand  fetching  algorithm  if  for  a  given  c  >  0, 
|x J  e  {0,  1}  and  if  r  e  S  ,  then  X  =  <fi.     A  paging  algorithm  is  a  demand 
replacement  algorithm  if  |Y  |  e  [0,  1}  and  if  r  e  S,  ,  or  |s   |  <  c 
then  Y     =  p.     A  paging  algorithm  is  a  demand  paging  algorithm 
(demand_fetch_demand_replace)  if  g  (S,  q,  x)  =  (S',  q' )  satisfies  the 


conditions : 

r 


S  if  x  e   S 

(      S+x  if  x  £  S  and    |s|    <  c 

L    S+x-y       3y  g   S,    if  x  jt   S,    |s|    -  c. 


22 

Thus  for  a  demand  paging  algorithm,  0  <  | Yt |  <  |X^ |  <  1.  A  paging 
algorithm  is  a  demand  prefetching  algorithm  if  gA(S,  q,  x)  =  (S»,  q*  ) 
satisfies  the  condition  that  if  x  e  S  then  X  ■  0.   In  particular,  the 
restriction  |X. |  <  1  is  removed.   And  finally,  a  paging  algorithm  is  a 
demand  prepaging  algorithm  [COFF73]  if  sA^t   <l>   x)  ■  (s'>  <T  )  satisfies  the 
conditions : 

r  S       if  x  e  S 
S'  =    <   S+X      if  x  ft   S,  x  e  X,  |S+X|  <  c 

L  S+X-Y    if  x  ft   S,  x  e  X,  |S+X-Y|  =  c. 

In  a  reference  string  w,  the  forward  distance  d  (x)  at  time  t 
for  page  x  is  the  distance  to  the  first  reference  to  x  after  time  t: 

{k,    if  r,  ,  is  the  first  occurrence 
'        t+k 
of  x  in  rt+1,  rt+2,  . . . 
co^    if  x  does  not  appear  in 
rt+l'  rt+2'  *•• 

Similarly,  the  backward  distance  b , (x)  is  the  distance  to  the  most 
recent  reference  to  x : 

""  k,    if  r ,  .  is  the  last  occurrence  of  x 

'       t-k 

bt(x)  =/       in  rl'  r2>    ""    rt 

i^oo^    if  x  does  not  appear  in  r,,  r  ,  ...,  r 

Presently,  we  consider  paging  algorithms,  which  work  with  a  fixed 
memory  allotment  (i.e.,  a  fixed  value  of  c).   Some  typical  demand  paging 
algorithms  will  now  be  described.   Let  S  =  S,  and  let  S'  =  S,  ]  = 
S  +  x  -  R(S,  q,  x)  such  that  x  4.   S, ,  x  e  S    and  R(S,  q,  x)  denotes 

u  t  t+1 

the  page  selected  for  replacement  so  that  R(S,  q,  x)  =  0  if  |s|  <  c. 
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(1)  LRU:     This  algorithm  chooses  the  page  least  recently  used 

as  the  page  to  he  replaced.   In  other  words, 

R(S,  q,  x)  =  y  iff  \(y)   =   max  Ot(z)]- 

zeS 

This  algorithm  is  used  widely  in  practice  and  is 
found  to  behave  fairly  well  [DENN70,  CORB69,  LIPT68] . 

(2)  FIFO:    This  algorithm  chooses  for  replacement  the  page 

that  was  fetched  first  in  MM.   This  algorithm 
behaves  well  only  for  highly  sequential  reference 
patterns.   However,  it  is  very  simple  to 
implement. 

(3)  RAM):    It  selects  the  page  to  be  replaced  randomly  from  the 

pages  resident  in  MM.   Clearly,  this  algorithm 
behaves  well  only  for  independent  reference  model 
of  program  behavior  [BELA66,  KING71,  COFF73].   It 
is  also  very  simple  to  implement. 

(h)     MIN:     The  page  to  be  replaced  has  the  largest  forward 

distance.   In  case  of  a  tie,  the  tie-breaking  rule 
uses  lexicographic  ordering.   This  algorithm  was 
first  proposed  by  Belady  [BELA.66],  and  it  is 
unrealizable  since  it  presumes  the  advance  knowledge 
of  the  program's  reference  string.   It  is,  however, 
useful  for  theoretical  purposes  since  it  can  be 
proved  optimal  in  a  certain  sense.   It  is  also 
called  BQ[COFF73]  and  0PT[MATT70]  sometimes. 

More  specifically  [COFF73],  R(S,  q,  x)  =  y  iff 
y  =  min  [z].   Here,  S*  is  defined  so  that 

zeS* 
ieS*  iff  d  (i)  =  max[d  (u)]. 
ueS 

Note  here  that  lexicographical  ordering  is  used  in  S*. 


2.3.1  Cost  of  a  Paging  Algorithm 

Aho  et  al  [AH071]  define  the  cost  of  processing  a  reference  string 

w  =  r-,,  rp,  ...,  r  with  a  paging  algorithm  A  operating  with  c  page 

7 
frames  of  MM:  C(A,  c,  <*>)  =  Z  h(|X,  |).   Here  h  is  a  function  such  that 

t=l    t 

h(0)  =  0,  h(k)  >h(l)  =  1.  Aho  et  al  prove  the  following  theorem: 

Suppose  h(k)  >  k  then  for  any  given  paging  algorithm  A,  there 

exists  a  demand  paging  algorithm  A'  such  that  C(A',  c,  co)  <  c(A,  c,  to) 
for  any  reference  string  to  and  for  any  value  of  c. 


2k 
We  note  that  for  rotating  auxilliary  memory  devices  like  discs 
and  drums  h(k)  <  k.   Only  for  large  core  storage  device  can  we  hope  to 
have  h(k)  =  k.   In  such  a  case,  the  above  cost  function  measures  the 
number  of  page  pulls. 

A  paging  algorithm  A  is  said  to  be  optimal  (with  respect  to  the 
cost  function  defined)  if  it  minimizes  C(A,  c,  u>)  for  all  reference 
strings  w  and  all  values  of  c.   It  can  be  proved  that  KEN  is  an  optimal 
paging  algorithm  by  the  above  definition  of  optimality  [COFF73,  P0ME71, 
MATT70] .   Therefore,  the  MEN  algorithm,  though  unrealizable,  is  useful  as 

a  benchmark  for  evaluating  other  paging  algorithms. 

+ 

Assume  that  a  probability  distribution  is  specified  over  N  such 

that  for  wg  N  ,  p(w)  is  the  probability  of  occurrence  of  oj,  then  we  define 
the  expected  cost  of  a  paging  algorithm  A  by: 


C(A,  c)  =  Z   p(co)  C(A,  c,  u)    [COFF73]. 


Define  a  paging  algorithm  to  be  optimal  with  respect  to  distribution  p  if 
it  minimizes  C(A,  c),  V  c  >  1.  Denning  et  al  [DENN68b]  proposed  a  demand 
paging  algorithm  A  ,  which  replaces  the  page  with  the  longest  expected 
time  until  next  reference.  It  can  be  proved  that  An  is  an  optimal  paging 
algorithm  with  respect  to  a  distribution  corresponding  to  the  independent 
reference  model  of  program  behavior  [AH071] . 


2.k    Working  Set  Algorithm 

We  consider  Denning ' s  working  set  algorithm,  which  is  a  variable 
memory  algorithm  [DENN68a] .   A  program's  working  set  at  time  t  is  defined 
to  be 


W(t,  t)  =  {i  €  N|  page  i  appears  among  r 


t-T+1 


t    . . . ,  r ,  J 
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where  t  is  a  parameter  called  the  window  size  and  t  >  1.   In  other  words, 

W(t,  t)  is  the  'contents'  of  a  window  of  size  t  looking  backwards  at  the 

reference  string  from  reference  r, .  Working  set  replacement  algorithm 

essentially  states  that,  do  not  replace  a  page  from  the  working  set.   This 

algorithm  is  a  little  difficult  to  implement  but  behaves  very  well  in 

practice  [D0HE70,  R0DR72,  RODR73,  WEIZ69] .  Working  set  principle  consists 

in  using  a  working  set  replacement  algorithm  and  using  the  following 

scheduling  policy:  a  program  may  run  if  its  working  is  in  MM  [DENF70] . 

Denning  has  shown  that  thrashing  can  be  avoided  by  use  of  WS  principle 

[DEJM68] . 

Denning  et  al  [DENN72b]  define  and  prove  several  important 

properties  of  the  working-set  model.   Define  the  working  set  size 

w(t,  t)  =  |w(t,  t)|  and  the  average  working  set  size  s(t)  =  lim  s  (t) 

i    k 

1  k 
where  s  (t)  =  7-  Z  w(t,  t).   Define  the  binary  variable 

"\(t,T)=flifr^w(t'T) 

L  0  otherwise, 

1  k_1 
then  the  missing  page  rate  (page  fault  probability)  m(x)  =  lim  —   E  A(t,  t), 

k-^00  K  t=0 
Suppose  that  in  the  reference  string  to,  two  successive  references  to  page  i 

occur  at  times  t  and  t+x. .  We  call  x.  an  interference  interval  for  page  i. 

11 

The  interference  distribution  for  page  i  is  defined  to  be 


F.  (x)   =  lim 
1  i 

K-KXJ 


The  interference  density  function  for  page  i  is  defined  to  be 

f.(x)  =  F. (x)  -  F. (x-l).  With  these  definitions  following  properties  hold: 


"no. 

x.    m  r,,    r„,    •••> 

r,    with  x.    <  x 
k                1  — 

_ 

no.   x.    in  r,, 
1            1' 

*•*'    rk                   J 
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PI:  1  ■  s(l)  <  s(t)  <  s(t+1)  <  min{n,  t+1). 

P2:  s(t+1)  -  s(t)  =  m(T). 

P3:  0  <  m(T+l)  <  m(T)  <  m(0)  =  1. 

PU:  m(T)  =  1  -  F(t)  =  L     f(y). 

y<T 

P5j_  m(T+l)  -  m(T)  -  -  f(T+l). 

T-l         T-l 
P6:   s(t)  =  Z    m(z)  =  Z  (l-F(z)) 
z=0       z=0 

T-l 

=  Z    E  f(z). 
z=0  y<z 

P7:  Lim   s(t)  =  n. 
P9:  Lim   m(x)  =  0. 

T^°° 

We  note  here  that  the  average  working  set  size  of  a  program  is  a 
very  good  indicator  of  its  locality  [DEKN70] . 

2.5  Performance  Measurement 

If  we  are  given  a  reference  string  w,  a  page  allotment  c  and  a 
paging  algorithm  A  then  we  can  simulate  the  processing  of  co  by  A  and  obtain 
the  page  fault  count  and  the  page  pull  count.  We  can  also  measure  ST 
product  provided  the  average  page  wait  time  is  assumed  to  be  fixed.  With 
a  demand  paging  algorithm  page  faults  and  page  pulls  are  identical  and 
the  ST  product  is  directly  dependent  on  page  faults,  therefore,  most 
measurements  are  restricted  to  only  page  pulls.  As  an  example,  we 
consider  the  reference  string,  u  =  ABCDEBCBDAEAC,  where  N  =  {A,  B,  C,  D,  E) 
We  use  c  =  3  and  find  the  page  faults  for  the  above  u  using  LRU. 
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The  number  of  page  faults  in  this  example  is  11.   Observe  how  the  LEU 
algorithm  chooses  a  page  to  be  replaced.   At  t=^,  ri.=D>  ri,  4   St;  wni-cn 
implies  a  page  fault  which  in  turn  implies  a  replacement.   In  S  ,  A  is  the 
page  used  least  recently,  therefore,  it  is  chosen  for  replacement.   Now 
for  the  same  w  and  paging  algorithm,  but  for  a  different  value  of  c,  we 
are  asked  to  find  the  page  fault  count,  we  will  have  to  go  through  the 
simulation  again.   This  means  a  rescan  of  the  reference  string  for  each 
new  value  of  c.   Since  practical  reference  strings  are  very  long,  say  a 
million  references,  the  simulation  is  very  inefficient.  Mattson  et  al 
[MATT70]  have  developed  a  technique  which  scans  the  reference  string  only 
once  and  produces  the  page  fault  count  for  all  values  of  1  <  c  <  n.   We 
will  have  occasion  to  use  his  ideas  in  Chapter  3. 

2.5.1  Stack  Algorithms  [C0FF73,  MATT70] 

We  introduce  the  notation  S(A,  c,  w)  to  stand  for  the  memory 
state  resulting  after  the  paging  algorithm  A  has  processed  reference  string 
w  in  a  memory  of  size  c,  assuming  that  S  =  f).      That  is,  if  w  =  r,,  r  , 
...,  r,  generates  S_,  S  ,  ...,  S  under  A,  then  S(A,  c,  w)  =  S, .   If  A 
is  understood,  we  simply  write  S(c,  w). 


28 

An  algorithm  A  is  called  a  stack  algorithm  if  its  memory  states 
satisfy  the  following  inclusion  property:  S(c,  w)  c  S(c+1,  w)  V  c  >  1, 
V  uieN  .   This  means  that  the  memory  states  form  a  collection  of  nested 
sets.   This  condition  is  equivalent  to  the  following  condition:   for  each 
w,  there  exist  a  permutation  of  N,  s(w)  -  (s^oj),  Sg(w),  . ..,  B  (»)), 
such  that,  for  all  1  <  c  <  n,  S(c,  u>)  =  {s1(w),  s2(w),  . ..,  sq(w)).  The 
vector  _s(w)  is  called  the  stack  vector  or  imply  the  stack.   For  a  given 
stack  algorithm,  the  inclusion  property  implies  that  for  each  reference 
string  r _,  . ..,  r  ,  a  sequence  of  stacks  s_  ,  s  ,  . ..,  _s  can  be  constructed 
so  that  the  memory  state  sequence  for  each  value  of  c  can  be  determined  by 
simply  taking  the  topmost  c  pages  of  the  stack.   This  property  of  stack 
algorithms  implies  that  the  page  fault  behavior  of  a  given  reference 
string  can  be  computed  effectively  in  parallel  for  all  memory  sizes 
1  <  c  <  n  and  in  one  scan  of  the  reference  string. 

Examples  of  stack  algorithms  are  MEN,  LRU,  LFU  (least  frequently 
used),  whereas  FIFO  is  not  a  stack  algorithm  [COFF73,  MATT70] . 

The  stack  distance  D  (w)  is  defined  for  page  x  to  be  the  position 
of  x  in  the  stack  s(w).   Thus  if  s.  (oj)  =  x,    then  D  (oj)  =  k.   If  x  does 

K  X 

not  appear  in  _s(w),  then  D  (w)  =  ».   Observe  that  a  page  fault  occurs 
for  page  x  as  the  last  reference  in  the  string  wx  iff  D  (to)  >  c,    since 

A 

the  first  c  elements  of  _s(oj)  are  the  contents  of  memory  of  size  c. 
Suppose  that  u>  =  rn,  r  .  ...,  r  is  processed  by  A  producing  the  stack 

-L    d.  y 

sequence  s_  ,  s^,    ...,    s,,  ...,  s  .   There  is,  associated  with  oo,  a  stack 

distance  sequence  D  ,  D0,  ...,  D  ,  where  L\  is  the  position  of  r^  in 

X.        d.  y  %  t 

St-l*   If  *(A>    c'  **)   denotes  the  number  of  page  faults  in  processing  oo 
with  memory  size  c  using  algorithm  A  then, 


Define 
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it  (A,    c,    to)  =    |{t|D     >  c,    1  <  t  <  y\. 

a.    =   |  (t|D ,    =  k,    1  <  t  <  7}  |    then  clearly, 

n 
it  (A,  c,  w)  =  a   +    Z   a  . 

k=c+l  K 


c 


Alternatively,  the  success  function  N(A,  c,  to)  is  N(A,  c,  u>)  =  Z  a, 

k=l  * 

the  success  frequency  F(A,  c,  to)  =  N(A,  c,  w)/^,  and  the  page  fault 
frequency  m(A,  c,  to)  =  jt(A,  c,  to)/y. 

Gordon  [G0RD73]  has  shown  a  method  by  which  one  can  compute  the 
average  working  set  size  and  the  page  fault  probability  for  the  WS 
algorithm  and  for  the  LRU  algorithm  in  one  pass  of  the  reference  string. 
Define  I  (w)  (reference  interval)  associated  with  each  page  in  the  stack 
_s(w)  as  the  number  of  distinct  pages  referenced  since  the  last  reference 
to  this  page  and  I  (tax)  =  1  (i.e.,  reference  interval  is  defined  to  be 
equal  to  1  at  the  time  this  page  is  referenced).   If  x^_s(oo)  then 
I  (to)  =  00.   Let  I  be  the  reference  interval  associated  with  page  r ,  =  x 
in  the  stack  s.   .   Then  associated  with  the  reference  string  ,  there  is 
a  reference  interval  sequence  I..,  Ip,  ...,  I  ,  ...,  I  .   Clearly 
«(WS,  T,  w)  =  I  {t|l  >  t,  1  <  t  <  7}  I  .   If  we  define,  b  =  |  {t|l  =  k, 


1  <  t  <  7)|,  then  *(WS,  t,  to)  =  b   +    z       b    Also  iq(t)  =  it(WS,  t,  to)/| 

k=T+l  K 

T-l 

and  s(t)  =  Z  m(z)  where  m(0)  =  1  by  definition. 
z=0 

We  have  noted  that  MEN  is  a  stack  algorithm  but  since  it  requires 
a  scan  of  the  future  reference  string  in  order  to  determine  the  page  to 
be  replaced,  a  one  pass  algorithm  is  difficult  to  obtain.  Mattson  et  al 
[MA.TT70]  have  given  a  two  pass  algorithm  to  get  MIN  statistics.  More 
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recently,  Belady  et  al  [BELA.7U]  present  a  one  pass  algorithm  to  obtain 
MIN  page  fault  behavior. 

So  far  we  have  discussed  methods  to  obtain  paging  performance  when 
the  reference  string  is  specified.  Collection  of  page  traces,  however,  is 
very  time  consuming  and  for  very  long  page  traces  running  simulations  to 
obtain  paging  performance  is  also  very  time  consuming.   It  is  beneficial, 
therefore,  to  consider  analytical  methods  to  predict  paging  behavior  of 
programs.   This  requires  a  model  of  program  behavior  to  be  formulated. 
King  [KING71]  has  analyzed  program  behavior  using  the  independent  reference 
model. 

2.5.2  King's  Model 

If  the  configuration  sequence  (SQ,  q_)  (S,,  q, )  ...  (S,,  q^ )  ... 
generated  by  a  paging  algorithm  form  a  Markov  chain,  then  analytical 
methods  exist  to  find  m(c)  where  m(c)  =  Z     p(w)  m(c,  co).   Here  N  is  the 

00 

uxeN 
set  of  all  infinite  reference  strings  over  N. 

Suppose  we  are  given  the  transition  probability  matrix  P  of  the 

Markov  chain  whose  states  are  V  x  Q>  where  V-{S|ScN,  |s|  =c} 

(i.e.,  y  is  the  set  of  memory  states).   Now  if  the  Markov  chain  is 

irreducible,  then  there  exists  a  unique  stationary  probability  distribution 

a  such  that  aP  =  P  and    z      a.  =   1.  Define  the  equivalence  class  of 

ie  V  XQ  1 

state  i  of  the  Markov  chain  by: 

[i]  =  [i  j  transition  from  i  -*  j  and  j  ->  i 
is  a  non-page  fault  transition} . 
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It  can  be  shown  that, 

VXQ 
m(c)  =  1  -    Z  (     L  a.    p 

0=1  {k|[k]=[j])  K  K'J 

where  P  =  (pv.)«  Under  the  independent  reference  model  of  program  behavior 
LRU,  FIFO,  A_  algorithms  satisfy  the  criteria  for  this  treatment  and  closed 
form  expression  of  page  fault  probability  can  be  obtained. 
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3.      GENERAL  PREPAGING  ALGORITHMS 

3.0  Introduction 

In  this  chapter  we  present  arguments  as  to  how  prepaging  can  be 
useful.  We  present  several  new  prepaging  algorithms  and  study  their 
behavior. 

We  have  noted  in  Chapter  2  that,  Aho  et  al  proved  that  given  an 
arbitrary  paging  algorithm  A,  there  exists  a  demand  paging  algorithm  A' 
which  performs  no  worse  than  A.   First  of  all,  this  theorem  holds  only  if 
h(k)  >  k.   For  rotating  auxilliary  memories  h(k)  <  k  and  only  for  bulk 
core  storage,  do  we  have  h(k)  =  k.   Since  rotating  auxilliary  memories 
are  in  widespread  use,  the  assumptions  of  the  theorem  do  not  hold  in 
practice.   If  h(k)  =  k  then  the  cost  function  defined  by  Aho  et  al  measures 
the  number  of  page  pulls.  If  the  channel  is  the  bottleneck  in  the  system 
then  the  page  traffic  should  be  minimized.   Heuristically,  minimization 
of  page  pulls  for  individual  programs  will  imply  minimization  of  the 
system  page  traffic.   Under  this  assumption  the  theorem  is  valid.  When 
cpu  overhead  is  critical  [MASU72,  0RGA72],  however,  and  we  have  multi- 
programming system  then  a  page  fault  implies  a  process  switch  in  which 
case,  how  many  pages  we  pull  after  a  page  fault  is  immaterial  and 
therefore  the  cost  of  a  paging  algorithm  is  the  number  of  page  faults 
it  incurs.   Under  this  assumption,  h(k)  =  1  V  k  >  1.   Reduction  in  cpu 
overhead  is  affected  in  two  ways :   first  since  the  number  of  page  faults 
is  decreased  and  as  we  have  seen,  few  milliseconds  of  cpu  time  is  spent 
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in  servicing  a  page  fault.   Secondly,  after  a  page  fault,  if  a  ready- 
process  is  not  available,  the  cpu  may  have  to  be  idle.   By  reducing  the 
number  of  page  faults  we  reduce  the  possibility  of  such  idleness.   The 
effect  of  prepaging  on  ST  factor  will  be  discussed  in  section  J>,1. 

Granted  that  prepaging  has  its  benefits,  there  are  two  problems 
yet  to  be  resolved:  who  should  specify  the  paging  needs  in  advance  and 
when  to  carry  out  the  page  fetches.   Haore  [HA0R72]  notes  that  the  memory 
management  routines  do  not  have  sufficient  knowledge  of  the  future  reference 
strings,  therefore,  these  forecasts  must  be  made  by  the  programs,  either 
directly  by  the  programmer  or  by  the  compiler.  We  agree  with  this  statement. 
To  show  the  viability  of  prepaging,  we  will  consider  a  particular  application 
and  show  that  prepaging  can  be  done  and  that  it  does  result  in  improved 
performance.   If  we  resort  to  demand  prepaging,  then  the  problem  of  when 
to  fetch  pages  is  solved.   Whenever  a  prepage  request  is  made,  we  just 
flag  the  page  and  on  the  occurrence  of  a  page  fault,  we  bring  in  all  the 
flagged  pages. 

Note  that,  under  demand  paging,  the  distinction  between  a  page 
fault  and  a  page  pull  vanishes  since  either  implies  the  other.   Under  an 
arbitrary  paging  algorithm  neither  need  imply  the  other.   Under  demand 
prepaging,  a  page  fault  implies  a  page  pull  but  not  vice  versa.   This 
means  that  for  demand  prepaging,  #  page  faults  <  #  page  pulls.   The  ideal 
for  the  number  of  page  faults  is  zero.   It  is  possible,  at  least  in  theory, 
to  achieve  this  ideal  with  arbitrary  prepaging  but  not  with  demand 
prepaging  or  with  demand  paging. 
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3.1  The  Optimal  Demand  Prepaging  Algorithm 

We  define  a  demand  prepaging  algorithm  DPMIN  using  the  notation 

of  section  2.3. 

DPMIN:  The  transition  function  g(S,  q,  x)  =  (S',  q' ) 

where, 

S'  -  S  if  x  e  S 

S'  -  [ylt  y2,   ...,  y£)   if  x  ft   s 

where  I   =  min(c,  |.w|)  and 

,w  =  r.j  . ..,  r  ,  and  where 
t     t       7 

V  i,  y.  e  N  &  V  x  e  N  -  S*   d  (x)  >  d  (y. ). 
In  other  words,  at  the  time  of  a  page  fault,  DPMIN  scans  the  future 
reference  string  and  fetches  the  first  c  pages  that  will  be  referenced 
in  future. 

Note  that,  DPMIN  is  unrealizable  in  the  same  sense  as  MIN  is 
unrealizable.  We  will  see  that  DPMIN  serves  as  a  benchmark  of  performance. 

We  define  a  paging  algorithm  A  to  be  optimal  with  respect  to 
page  faults  if  for  any  arbitrary  paging  algorithm  A',  and  V  c  >  1, 
V  oo  e  n*, 

it  (A,  c,  go)  <  it(A',  c,  oj), 

where  n  denotes  the  number  of  page  faults „ 
Theorem  3.1: 

DPMIN  is  optimal  among  all  demand  prepaging  algorithms  (in  number 
of  page  faults). 

Proof:  Let  jl  (S,  t)-  denote  the  minimum  achievable  cost  under  demand 
prepaging  of  processing  references  r .  _,  ...,  r,   given  that  S  =  S.   If 
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we  define  itn(S,  t)  =  0  and  let  r ,  ,  =  x  then  we  can  write 


r 


*k(s,  t)  = 


it,  JS,    t+1)    if  x  e 
k-1 


1  +  min   it.  .  (S+X-Y,  t+l) 
YcS    k_1 

XcN-S  if  x  ft   S. 

xeX 


This  relation  may  be  recognized  as  the  principle  of  optimality 
in  a  dynamic  programming  problem  [AH071,  C0FF73] .   The  proof  of  the 
theorem  now  reduces  to  showing  that  paging  according  to  DBCEN  is 
characterized  by  the  above  principle  of  optimality. 

Let  <,  be  an  ordering  defined  over  N  for  V  t  >  1  such  that 
x  <  y  iff  d  (x)  <  d  (y)  (if  d  (x)  =  d  (y),  lexicographic  ordering  is 
assumed).   Denote  M  =  {y  ,  ...,    y   }  such  that  V  x  e   N-M  ,  x  <  y. 
for  V  1  <  i  <  c-1.   Then  the  transition  function  for  DFMIN  may  be  written 
as : 


'DFMIN 


(S,  q,  x)  = 


(S,  q')    if  x  e  S 

(M  +x,  q»)    if  x  ft   S, 


Then  it  is  sufficient  to  show  that 


il  (M.+x,  t-1)  =  min.   (jl  (S+x,  t-l)), 
K     t  ScN-x    * 

|S|=c-l 


Clearly,   it,(M  +x,  t-l)  =  it,   (M  +x,  t)  and 

K.      "C  K—  -L   "t 


itk(S+x,  t-l)  =  «k-1(S+x,  t), 
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therefore,  it  is  sufficient  to  show  that 

Tt,(M.+x,  t)  =   min.    («,  (S+x,  t)). 
k  t  3rN-x     k 

|BT-  c-1 

In  fact,  we  will  prove  that  An,  =  jt,  (S+x,  t)  -  jl  (M.+x,  t)  <  1.  We  will 

prove  this  by  induction  on  k.   It  is  clearly  true  for  k  =  0  (by  definition 

of  n  (S,  t)).  Assume  true  for  V  i  <  k.   If  S  =  M  then  we  are  done, 

therefore  assume  S  f   M  and  let  i  be  the  smallest  index  such  that  y.  4.   S, 

"C  i 

also  let  d  (y. )  =  i.   First  note  that  i  <  c-1.   Then, 

rtk(S+x,  t)  =  1  +  min.(Trk_^(S+X-Y,  t+i+l)) 


and 


«k(Mt+x,  t)  =  *k_^(Mt+x,  t+i+l). 


By  inductive  assumption, 


min.  (ji^CS+X-Y,  t+i+l)) 


Also,  by  induction, 


^k-i  =  Vi(s'+rt+i+i'  t+i+1> 


^-^Mf.+  ^+r++^V  t+'+1)  <!■ 


Then, 


A\  =  irk(S+x,  t)  -  \(Mt+x,  t) 


1  +  \-i(Mt+i+rt+i+i>  t+i+1> 


-  ^^(M^x,    t+i+l) 
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note  that  r      „   ,    €  M  +x. 

.'.   taking  S'    =  Mt+x"rt+^+1* 
^k  =  2  +  Vi(Wrt+i+l'   t+2+1) 

•     -  v/s'+rt+^  t+^+1) 

=  0  or  1  by  the  inductive  hypothesis  on  Art,  g. 
Therefore  for  V  k  >  0, 

This  completes  the  proof  of  the  optimality  of  DPMIN. 
Lemma  3. !• 

Given  two  fixed  memory  paging  algorithms  A  and  A' 

it  (A,  c,  w)  <  it(A',  c,  go)=>ST(A,  c,  w)  <  ST^,  c,  go) 

on  the  average,  assimiing  the  behavior  of  other  concurrent  programs  remain 

the  same. 

Proof:   Since  memory  alloted  to  the  program  is  fixed,  ST  =  c  *  t  where 

t  is  the  total  time  that  the  program  occupies  the  MM.   t  ~  t    + 
m  *     a  *■  m  _     CpU 

t       . ,  +  t   ,       .   Since  cpu  time  of  a  program  is  unaltered  by 
page-wait    ready  queue 

a  change  in  the  paging  algorithm,  only  the  other  two  factors  need  to  be 
considered.  Now  assume  T1  is  the  average  time  that  the  program  has  to 
spend  in  page  wait  and  then  in  the  ready  queue.   Clearly,  T'  depends  on 
the  characteristics  of  other  concurrently  executing  programs  and  on  the 
average  page  fetch  time  T.   Therefore,  T'  can  be  assumed  to  be  a  constant. 

t  =  t    +  #page  faults  *  T' 
m    cpu    *   to 

The  required  result  immediately  follows. 
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Theorem  3.2: 

DIMM  minimizes  ST  product  among  all  demand  prepaging  algorithms 
under  the  assumptions  of  lemma  3.1. 

Proof:   The  proof  of  this  theorem  follows  directly  from  theorem  3.1  and 
lemma  3.1. 

We  would  have  liked  to  prove  optimality  of  DIMM  with  respect  to 
page  pulls,  but  unfortunately,  it  does  not  hold.  We  will  be  able  to  see 
this  in  Chapter  5.  As  a  result  of  lemma  3.1>  we  need  to  only  discuss 
page  fault  and  page  pull  measures  for  all  fixed  memory  type  paging 
algorithms. 

Note  that  the  class  of  all  demand  paging  algorithms  is  included 
in  the  class  of  all  demand  prepaging  algorithms.   Therefore,  applying 
Theorem  3.1  to  the  optimal  demand  paging  algorithm  MM,  we  get 

it  (DIMM,  c,  w)  <  jt(MM,  c,  w). 
We  will  show  that  there  are  cases  when  strict  inequality  holds.   On  the 
other  hand,  it  is  trivial  to  find  examples  for  which  equality  holds. 
Note  that  MM  is  optimal  in  page  pulls  among  all  paging  algorithms, 
therefore  , 

C(DIMM,  c,  w)  >  C(MM,  c,  co) 

=  it  (MM,  c,  u>). 

Once  again,  there  are  cases  when  equality  holds  in  the  above  relation 
and  there  are  cases  when  strict  inequality  holds.   Thus  DIMM  is  superior 
in  two  performance  measures  (page  faults  and  ST  product)  and  MEN  is 
superior  in  page  pulls.   Quantitative  results  of  the  comparison  will  be 
given  in  Chapter  5  for  specific  applications. 
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3.1.1  Performance  Measurement  for  DPMIN 

The  problem  is  to  find  the  number  of  page  faults  and  the  number 
of  page  pulls  for  a  given  reference  string  w  and  a  given  page  allotment  c, 
using  the  DPMIN  paging  algorithm.   We  present  a  numeral  matrix  algorithm 
for  this  purpose,  based  on  the  numeral  matrix  algorithm  for'MIN  [BEIA7U] . 
The  numeral  matrix  has  as  many  rows  as  the  number  of  pages  in  the  VA  space 
and  as  many  columns  as  the  nu  er  of  page  faults.   The  algorithm  is  as 
follows : 

Initially  the  matrix  is  blank. 

1.  Suppose  the  next  reference  in  w  is  to  a  page  x  and  that 
the  rightmost  nonempty  column  is  (i-l). 

2.  Let  j  be  the  rightmost  column  with  c  markings 
(note:  j  =  i-l  or  i-2). 

3.  If  j  =  i-l 

then  if  the  entry  (x,  i-l)  is  blank 

then  mark  (x,  i)  and  return 
else  return 
else 
k.      If  j  =  i-2  then 

if  (x,  j)  is  blank  then  mark  (x,  i-l) 
and  return  else  mark  (x,  i-l)  and  return. 
In  step  3  marking  (x,  i)  implies  a  page  fault  and  a  page  pull.   In  step  k 
if  (x,  j )  is  blank  then  a  page  pull  occurs  (no  page  fault). 

After  the  complete  reference  string  is  processed,  then  the  number 
of  page  faults  is  obtained  by  counting  the  number  of  nonempty  columns  of 
the  numeral  matrix.   Number  of  page  pulls  in  excess  of  page  faults  is 
obtained  by  counting  the  number  of  times  the  clause  'if  (x,  j )  is  blank' 
is  satisfied  in  step  3. 
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Clearly,  this  algorithm  is  c -dependent.  We  illustrate  its  use 
by  the  following  example. 

N  =  {A,  B,  C,  D,  E},  c  =  3, 

w  =  ABCDEDBCBMEAC .   The  number  1  is  used  for  marking. 
If  a  page  pull  occurred  without  a  page  fault  then  we 
indicate  this  with  an  asterisk.   The  numeral  matrix 
for  our  example  is : 


A 

1 

1 

B 

1* 

1 

1 

C 

1* 

1 

1 

D 

1 

1 

E 

1* 

1* 

Clearly,  the  number  of  page  faults  is  equal  to  four.   The  number  of  page 
pulls  is  given  by: 

#  page  pulls  =  #  of  1*  +  #  page  faults 
=  k  +  h  =  8. 
Belady  et  al  [BEIA7^]  show  that  for  the  same  reference  string,  the  MEN 
algorithm  produces 

#  page  faults  =  8  =  #  of  page  pulls. 

The  numeral  matrix  algorithm  is  clearly  c-dependent,  i.e.,  for 
each  new  value  of  c,  a  rescan  of  the  reference  string  is  needed.   In 
particular,  DFMEN  is  not  a  stack  algorithm.  We  may  verify  this  fact  by 
the  following  example :  w  =  ABCD,  for  c  =  2,  the  memory  state  sequence 
is:   {},  {A,  B},  {A,  B},  {C,  D},  {C,  D} .   For  c  =  3,   the  memory  state 
sequence  is:   {},  {A,  B,  C},  {A,  B,  C},  {A,  B,  C),  {D}.   From  these, 
S  (2)  ^  S  (3)  which  gives  the  required  result. 


Another  problem  in  the  implementation  of  the  numeral  matrix 
algorithm  is  the  storage  of  the  matrix.  A  close  inspection,  however, 
reveals  that  only  two  columns  of  the  numeral  matrix  need  be  stored  at 
any  one  time.   In  step  2,  we  have  noted  that  j  =  i-1  or  i-2.   This  is 
"because,  we  create  a  new  column  only  when  the  previous  column  is  full. 
Thus  only  the  present  column  and  the  last  column  need  be  kept  in  storage. 
A  PL/l  implementation  of  DPMIN  is  given  in  Appendix  A. 

3.2  Realizable  Prepaging  Algorithms 

We  have  noted  that  DPMIN  is  an  unrealizable  paging  algorithm.  We 
would  like  to  consider  same  paging  algorithms  which  can  be  realized. 

3.2.1  Freeing  Dead  Pages 

Assume  that  at  time  t,  a  certain  page  x  is  'dead',  i.e.,  it  does 
not  occur  in  the  reference  string  after  time  t.   This  is  equivalent  to 
saying:  d  (x)  =  <».  Assume  that  either  the  programmer  or  the  compiler 
has  discovered  this  fact.   Furthermore  assume  that  this  information  is  pro- 
vided to  the  operating  system  so  that  either  it  can  push  the  page  x  onto  the 
secondary  memory  or  it  can  flag  the  page  dead  so  that  at  the  time  of  the 
next  replacement  decision,  page  x  will  have  the  highest  priority  to  be 
replaced.  Assume  that,  originally,  we  started  out  with  a  demand  paging 
algorithm  A,  and  then  modify  it  by  the  above  mechanism  and  call  it  FREEA. 
The  process  of  declaring  a  page  dead  will  be  called  Freeing.   There  are 
two  possible  ways  to  interpret  FREEA:  One  way  is  to  assume  that  on  the 
issuance  of  a  FREE(x)  instruction,  page  x  is  replaced.   In  this  case 
FREEA  is  a  demandrfetch-anticipatory-replace  algorithm.   The  second  way 
is  to  assume  that  on  the  issuance  of  a  FREE(x)  instruction,  page  x  is 
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flagged  (i.e.,  a  'dead*  bit  associated  with  the  page  table  entry  of  page  x 
is  set),  and  when  a  replacement  decision  is  to  be  made,  first  a  search 
is  made  for  a  page  with  the  dead  bit  on.   If  none  is  found  then  the 
replacement  procedure  as  in  algorithm  A  is  followed.  With  this  interpre- 
tation, FREEA  is  a  demand  paging  algorithm.   Quite  clearly,  these  two 
interpretations  do  not  affect  the  number  of  page  pulls  or  the  number  of 
page  faults  and  therefore,  we  do  not  have  to  distinguish  between  them. 
It  is  clear  that  n(FREEA)  <  tt(A).   To  verify  this,  assume  that  A  has  made 
a  wrong  replacement  decision  at  time  t,  and  further  assume  that  working 
with  FREEA,  at  time  t,  a  dead  page  was  present  in  memory  and  therefore 
FREEA  will  not  make  a  wrong  replacement  decision.   On  the  other  hand,  if 
FREEA  has  made  a  wrong  replacement  decision  then  quite  clearly,  A  must 
have  also. 

It  is  also  clear  that  rt(MIN)  <  jt( FREEA),  since  FREEA  is  a  demand 
paging  algorithm  and  the  optimal! ty  of  MIN  is  applicable.   It  should  be 
noted  that  it  (MIN)  =  ^(FREEMIN),  since  MIN  never  makes  a  wrong  replacement 
decision.   If  we  started  out  with  a  demand  prepaging  algorithm  A  then  we 
can  show  that 

jt(DPMIN).<  jt(FREEA)  <  it(A) 

and  that  jt(DFMIN)  =  it ( FREEDPMIN ) . 

If  we  start  out  with  a  general  paging  algorithm  A,  then  0  <  it  ( FREEA.)  <  k(A), 
Freeing,  however,  tends  to  reduce  the  ST  product:   First  by  reducing  the 
number  of  page  faults  and  second  by  reducing  the  average  page-wait  time. 
The  second  reason  holds  only  when  anticipatory  replacement  is  used. 

We  have  pointed  out  in  Chapter  2  that,  it  is  desirable  that  a 
given  paging  algorithm  is  a  stack  algorithm.   Theorem  3.3  answers  this 
question  for  FREEA. 
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Theorem  3 • 3 '   (Assume  A  is  a  demand  paging  algorithm) 

If  A  is  a  stack  algorithm  then  FREEA  is  a  stack  algorithm. 
Proof:  The  proof  makes  use  of  the  proposition  6.1  from  the  book  by 
Coffman  et  al  [COFF73]  which  states  that: 

A  demand  paging  algorithm  B  is  a  stack  algorithm 
iff  R(S+y,  q,  x)  =  R(S,  q,  x)  or  y  whenever  x  ^  S  +  y. 
Let  r  denote  the  set  of  distinct  pages  in  the  string  r_,  rp,  . ..,  r  . 
FREEA  partitions  the  pages  of  r,  into  two  classes.   One  class  is  the  set 
of  dead  pages.  We  assume  that  the  set  of  dead  pages  is  lexicographically 
ordered,  and  the  set  of  nondead  pages  is  ordered  by  the  same  ordering  as 
in  algorithm  A.   Let  the  set  of  dead  pages  in  S  be  denoted  by  A(S).  We 
want  to  prove  that  the  above  proposition  is  satisfied  by  FREEA.   Therefore, 
consider  the  following  two  cases : 
(i)  [y  e  A(S+y)]: 

If  y=  min.  (A(S+y))  then  R(S+y,  q,  x)  =  y 

and  we  are  done,  otherwise 
if  y/  min.  (A(S+y))  then  let 

y'  =  min.  (A(S+y)).   This  means 
y1  =  min.  (a(S))  which  implies 
R(S+y,  q,  x)  =  R(S,  q,  x)  =  y« 
and  we  are  done, 
(ii)   [y^A(S+y)]: 

If  A(S)  =  $   then  A(S+y)  =  fi   (since  y  j.   A(S+y)) 
therefore,  FREEA  behaves  exactly  like  A 
which  means  R(S+y,  q,  x)  =  R(S,  q,  x) 
(since  A  is  a  stack  algorithm). 
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If  A(S)  /  f>   then  let  y'  =  min.(A(S)). 
Since  y  £   A(S+y)  =»A(S)  =  A(S+y) 
=>yf  =  min.(A(S+y))  ^> 

R(S+y,  q,  x)  =  R(S,  q,  x)  =  y\ 
Thus  FREEA  is  a  stack  algorithm. 
In  Appendix  A,  we  give  a  PL/ I  program  for  FREELRU. 

3.2.2   Prepaging 

We  now  consider  some  practical  prepaging  algorithms.   Our 
objective  is  to  consider  prepaging  algorithms  which  reduce  page  faults 
without  increasing  page  pulls  significantly  and  thereby  obtain  an 
improvement  over  demand  paging  algorithms.   When  working  with  fixed 
memory  algorithms,  at  the  beginning  of  a  time  slice,  a  few  page  frames 
remain  unoccupied  with  a  demand  paging  situation.   If  we  could  prepage 
seme  useful  pages  to  fill  up  this  unused  space,  we  would  probably  reduce 
page  faults.   The  second  important  rule  we  incorporate  into  the  algorithm 
is  that  we  never  replace  a  useful  (non-dead)  page  for  bringing  in  a 
prepaged  page.   This  reduces  the  possibility  of  increasing  the  page 
faulting  in  a  prepaging  algorithm   After  the  MM  is  full,  this  scheme 
would  not  allow  us  to  do  any  prepaging  unless,  we  have  some  dead  pages 
which  can  be  freed.  We  will  see  that  this  situation  is  the  most  valuable. 

The  first  prepaging  algorithm  that  we  propose  is  PRE1A.  Assume 
that  reference  string  is  modified  to  have  PRE(x)  inserted  at  various 
places. 
PRE1A:   (Assume  A  is  a  demand  paging  algorithm) 

If  |sj  <  c  then  if  r.+1  =  PRE(x) 

and  x  f(   S  then  S    =  S  +  x. 

Otherwise  PRE1A  works  just  like  A. 


This  algorithm  has  two  drawbacks.   First,  it  implies  a  page-wait  for  each 
prepage  operation  which  is  what  we  were  trying  to  avoid  in  the  first 
place  and  second,  it  is  not  able  to  prepage  frequently  since  the 
situation  | S , |  <  c  arises  only  initially.  We  can  solve  the  first 
problem  in  two  ways.   One  way  is  to  allow  the  execution  of  the  program 
and  the  input  operation  of  the  prepaged  page  simultaneously,  which  will 
be  called  the  "overlap  solution. "  A  second  way  is  to  concatenate  many 
page  fetches  together  and  resort  to  demand  prepaging.   The  solution  to 
' |S  |  <  c'  problem  is  to  combine  freeing  and  prepaging  together. 

If  we  use  the  overlap  solution  then  after  the  occurrence  of  PKE(x) 
in  the  reference  string,  a  page  frame  is  reserved  for  x  but  x  cannot  be 
assumed  to  reside  in  S .  , -.  •   Page  x  is  then  said  to  be  in  a  'not-set-up' 
state.   After  elapsing  of  real  time  T  (the  average  page-wait  time),  page  x 
said  to  be  in  'set-up'  state  and  can  be  considered  resident  in  MM.   Now, 
if  before  a  page  x  is  set  up,  a  page  fault  for  another  page  y  (or  the 
same  page  x)  occurs,  then  during  this  page-wait  period,  page  x  will 
become  set-up.   Thus,  in  general,  on  the  occurrence  of  a  page  fault,  all 
not-set-up  pages  in  MM  will  become  set-up.  We  divide  the  memory  state 
St(c)  into  two  disjoint  sets  U  (c)  and  N.(c).   N (c)  consists  of  the  set 
of  all  not- set-up  pages  and  U  (c)  =  S  (c)  -  N  (c).  We  will  assume 
c  >  2  for  all  prepaging  algorithms.  We  now  define  PRE2A.   We  associate 
a  set-up  counter  'COUNT'  with  each  not-set-up  page. 
PRE2A: 
[1]   (Page  fault  for  a  not-set-up  page) 

If  r+.+i  =  x  €-N4-^c^  tlien  (declare  all  not-set-up  pages  set-up) 

\         \+i^  =  Vc> u  V°>  -  st(o)  =  Wo)-  I 

Nt+1(c)  .  0. 


We  can  easily  add  freeing  to  PRE2A  and  obtain  FREEPRE2A.   It  can 
easily  be  proved  that  both,  PRE2A  and  FREEPRE2A  are  not  stack  algorithms. 

It  appeared  at  first  that  if  we  restrict  our  attention  to  a 
certain  class  of  reference  strings  then  PRE2A  could  be  proved  to  be  a 
stack  algorithm.  We  will  say  that  a  reference  string  co  satisfies  the 
property  P  if  r,  =  PRE(x)  =>  (j§  t1  <  t  such  that  r»t ,  =  x  or  rt ,  =  PRE(x)). 

Initially  it  was  believed  that  PRE2A  is  a  stack  algorithm  for 
(u  £  N    P(w)}j  however,  this  was  found  to  be  false.   The  reason  for 
this  can  be  explained  as  follows:   Suppose  we  are  in  step  (3)  of  PRE2A. 
For  a  certain  value  of  c,  say  c.,  assume  that  r    =  x  ^  S  (c, )  and  let 
c.  be  the  largest  such  value  of  c,  =#  x  e  S  (c.,+1)  (assume  that  x  is 
set-up  in  a  memory  of  size  c  +l).  A  page  fault  occurs  for  c  =  c,,  whereas 
no  page  fault  occurs  for  c  =  c.,+1  =>U,   (c,  )  =  S  (c  )  +  x  -  y,  ye  U  (c  ). 
But,  U^Cc-j+l)  =  U^c-j+l)  U  {y  e  IT  (Cl+1),  COUNT(y)  >  T-l} .   Now,  the 
inclusion  property  for  U  cannot  be  proved  in  general.  A  similar 
situation  occurs  in  step  (l).  We,  therefore,  modify  PRE2A  to  obtain 
PRE3A,  which  is  indeed  a  stack  algorithm. 
PRE3A : 
[1]  If  r.+1  =  x  e  N  then 

[a]  if  x  e  I  (c)  then  (considered  to  be  a  page  fault) 

Vy  6  Nt(c),  COUNT(y)  «-  COUM](y)  +  1; 

Ut+1(c)  =  Ut(c)  +  x  U{y|y  €  ^(c),  COUM!(y)  >  T}  ; 

St+1(c)  =  St(c); 

Wc)  =  Wc)-ut+i(c)' 

and  RETURN: 
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[2]   (Page  reference  to  a  set-up  page) 

If  r,  .  =  x  €  U,  (c)  then 
t+1       t 

(increment  COUNT  for  each  non-set-up  page) 

v  y  e  N ,  (c)    COUNT(y)  <-  COUNT(y)  +  1. 

(declare  all  non- set-up  pages  with  COUNT  >  T,  set-up) 

Ut+1(c)  =  Ut(c)  U  {y  e  Nt(c)  |  COUNT(y)  >  T} . 

St+l(c)  =  St(c) 

Nt+l(c)  =St+l(c)  "Ut+l(c)- 

[3]   (reference  to  page  not  in  S,  (c)  =>  page  fault) 
If  r.+1  =  x  fi   S.(c)  then 

(bring  in  the  required  page,  replace  a  page  if 
necessary  and  declare  all  not-set-up  pages  set-up). 
Ut+1(c)  =  St(c)  +  x  -  RA(Ut,  q,  x) 

Nt+i(c)  -*  Wc)  =  ut+i^c)- 

Note  here  that  R.(U  ,  q,  x)  denotes  the  page  that  would 
have  been  replaced  by  paging  algorithm  A  with  the  given 
values  of  the  parameters. 


[k]      (a  prepage  instruction) 

If  r    =  PRE(x)  and  x  ft   S  (c)  then 
if  | S  |  <  c  then 

\+1(c)  =  Nt(c)  +  x, 


COUNT (x)  <-  0 


Vi(c)  =  ut(c)- 


End  PRE2A. 


ka 


[b]      If  x  e  U. (c)  then   (reference  to  a  set-up  page) 
Vy  e  N.(c)     COUNT(y)  *-  COUNT(y)  +  lj 

(c)  =  U,  (c)  U{y|y  e  N.  (c),  COUNT(y)  >  T} ; 
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t+1 

Wc)  -  st(c)' 

Nt+i(c)  =  st+i(c)  -Vi(c) 

and  RETURN; 
[c]   If  x  fi   T.  (equivalent  to  saying  x  ^  S ,  (cf)  for  any  cf 

which  is  equivalent  to  saying  ^  t'  <  t  such  that  r  ,  =  x  or 
r ,  ,  =  FRE(x)  which  is  equivalent  to  saying  x  ^  s_  (k) 
for  any  k)  then 

if  |S,(c)|  <  c  then 

Ut+1(c)  =  St(c)  +  x;  Nt+1(c)  =  cp; 

St+1^  =  Ut+l(c)' 

else  Ut+1(c)  =  St(c)  +  x  -  RA  (U^  q,  x); 


Nt+1 
and  RETURN; 


(c)  =  <p;   S.  _(c)  =  U.^Cc) 


t+1 


t+1 


[d]   If  x  £   S  (c)  [but  x  e  r  ]  then 

X  ~G 

V  y  e  N,(c)  COUNT(y)  <-  COUNT(y)  +  1; 

if  |  S  (c)  |  <  c  then 

U++1(c)  =  U  (c)  +  x  U{y|y  6  IS  Ac),    COUNT(y)  >  T) ; 


t+1 


(c)  =  S.(c)  +  x;  N+J,(c)  =  S,^(c)  -  U.  ..(c); 


t+1 


t+1' 


t+r 


and  RETURN; 
else  (|st(c)|  =  c) 

U+4.n(c)  =  U.(c)  +  x  -  R A(U+,  q,  x) 


t+1 


t+1 


u(y|y  €  Nt(c),   COUNT(y)  >  T) 
(c)  ==  S,  (c)  +  x  -  Rfl(U,  q,  x); 


t 


Av  V 


N 


t+lv 


(c)  =  S+J-(c)  -  U+j_n(c); 


t+1 


t+1' 
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and  RETURN; 
else  RETURN; 
[2]   If  j»  ,  =  PRE(x),  x  e  N  and  x  ^  St(c) 

then 

if  |S  (c)  <  c  then  Nt+1(c)  =  Nt(c)  +  x; 

Ut+1(c)  =  Ut(c);  COUNT(x)  -  0; 

St+1(c)  =  St(c)  +  x; 

and  RETURN; 
END  PRE3A; 

We  will  say  that  an  absolute  page  fault  has  occurred  at  time  t+1 
if  r  =  x  and  x  ^  T,  (or  any  of  the  equivalent  conditions).  What  this 
amounts  to  is  that  r  =  x  is  the  first  occurrence  of  x  in  w.  PRE3A  is 
different  from  PRE2A  in  that,  only  on  the  occurrence  of  an  absolute  page 
fault,  all  not- set-up  pages  are  declared  set-up,  whereas  in  the  latter 
on  every  page  fault  this  is  done. 
Theorem  3.^-' 

PRE3A  is  a  stack  algorithm  for  {w|co  e  N+,  P(w)}. 
Proof:   See  Appendix  B  for  the  proof. 

Theorem  3.^-  allows  us  to  construct  an  efficient  one  pass  algorithm 
to  obtain  the  paging  performance  of  a  given  reference  string  w.   Because 
of  the  nature  of  PRE3A,  we  would  expect  PRE2A  to  give  better  results  than 
PRE3A.   Intuitively,  it  seems  that  jt(PRE2A,  c,  to)  <  n(PRE3A,  c,  w), 
though  we  have  not  been  able  to  prove  this  relation.   However,  results 
of  applying  PRE3A  to  a  certain  reference  string  does  tell  us,  at  least 
approximately,  the  performance  of  PRE2A.  We  note  that  for  each  of  the 
algorithms  PREiA  we  can  create  another  algorithm,  say  FREEPREiA,  which 
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incorporates  freeing  and  thereby  improve  the  paging  performance.   Clearly, 
FREEPRE3A  Is  a  stack  algorithm  which  follows  from  theorem  3.3. 

In  both,  PRE2A  and  PRE3A,  whenever  a  page  made  a  transition  from 
not-set-up  state  to  set-up  state,  we  immediately  consider  it  the  same  as 
any  other  page  in  U,  .   This  means  that  a  prepaged  page  can  be  replaced 
even  before  it  is  used  at  least  once.   This  kind  of  replacement  may  lead 
to  increased  page  traffic.   When  a  PRE(x)  has  been  inserted  in  the 
reference  string,  e  ither  by  the  programmer  or  by  the  compiler,  we  have 
reason  to  believe  that  page  x  will  be  used  in  the  near  future,  so  it 
might  pay  off  to  keep  the  page  locked  in  MM  until  its  first  use.  We  now, 
divide  the  memory  state  S  (c)  into  three  disjoint  sets:  U  (c)  is  the 
set  of  used  pages  (used  at  least  once),  P  (c)  is  the  set  of  prepaged, 
set-up  but  not  yet  used  pages  and  N  (c)  is  the  set  of  not-set-up  pages. 
We  define  PRE^A  corresponding  to  PRE2A  as  follows : 
PREVIA: 
[1]  If  r    =  x  e  N  then 

[a]   if  x  e   N  (c)  then  (page  fault) 

Ut+l(c)=Ut(c)  +  X 
Pt+1(c)  =  Pt(c)  +  Nt(c)  -  x 


N 


t+1 


(c)  =  fi   and  RETURN; 


[b]   if  x  €  P,  (c)  then  (success) 

Vy  e  n  (c)  COUNT(y)  -  COUM'(y)  +  1; 


Pt+l(c)  =  pt(c)  -  x  U(yk  6  Nt(c),  CCOTT(y)  >  T}; 

Ut+l(c)  -Ut(c)+^ 

Nt+1(c).  =  (y  G  Nt(c)  |  COUNT(y)  <  T}  ; 


t+1 
and  RETURN; 
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[c]  if  x  e  U, (c)  then   (success) 

V  y  e  N   (c),    COUNT(y)  *-  COUNT(y)  +  1; 

Vi(c)  =ut(c); 

Pt+1(c)   =  Pt(c)  U    {y\y  e   Nt(c),    COUNT(y)  >  T} ; 
Nt+1(c)  =   {y  e  Nt(c)    |    COUNT(y)  <  T} ; 
and  RETURN; 

[d]  if  x/  S    (c)  then   (page  fault) 

Ut+l(c)   =  Ut(c)   +  X  '  VUf    q'    x); 
Pt+1(c)  =  Pt(c)  U  Nt(c); 

Nt+1(c)   =  P; 

and  RETURN; 
[2]   If  r    =  PRE(x)  then  if  x  j.   S  then 
if  | S  ( c ) |  <  c  then 

N.+1(c)  =  N  (c)  +  x;  COUNT(x)  <-  0; 

Pt+l(c)  =Pt(c)?  Ut+l(c)  =  Vc)' 
END  PRE^+A; 

Quite  clearly,  PRE^A  is  not  a  stack  algorithm.  We  can  define 

another  algorithm  PRE5A  which  does  a  not-set-up  to  set-up  conversion  only 

on  the  occurrence  of  an  absolute  page  fault.   PRE5A  will  be  proved  to 

be  a  stack  algorithm. 

PRE5A: 

[1]   If  rt+1  -  x  e  N  then 

[a]   if  x  e  N  (c)  then  (a  page  fault) 

Vy  e  Nt(c),C0UNT(y)  «-  COUNT(y)  +  1; 

pt+i =  pt u  {y  €  Nt'x  I  C0UWT(y)  >T^ 

Ut+l=Ut  +x; 

Nfc+1  =   {y  €  Nt-x    I    COUNT(y)  <  T) ; 

and  RETURN; 
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[b]  if  x  e   P,  (c)  then   (success) 

V  y  e   N  ,    COUNT(y)  «-  COUNT(y)  +  1; 
Pt+1  =  Pt-x  U{y  e  Nt    |    COUNT(y)  >  T} ; 

Ut+1  =  Ut  +  X' 

Nt+1  =   {y  e  Nt    |    COUNT(y)  <  T} ; 

and  RETURN; 

[c]  if  x  e  U,    then  (success) 

Vy  e   N  , COUNT (y)  «-  COUNT(y)   +  1; 

ut+i  ■  V 

pt+i  =  pt  U{y  €  Nt  I  C0UKT(y)  >  T); 

Nt+1  =   {y  €  Nt    |    COUNT(y)  <  T} ; 
and  RETURN; 

[d]  if  x  ft  T,    then   (an  absolute  page  fault) 

Ut+1  "  Ut  +  X  "  W    *<    X); 

Pt+1  "   Pt  U  V 

Ht+1  "  * 
and  RETURN; 

[e]  if  x  ft  S     then   (a  page  fault) 

Vy  e   N  ,  COUNT(y)  «-  COUNT(y)  +  1; 

VI  =  Ut  +  X  "  VUt'  *>    X)' 

pt+i =  pt  U{y  e  Nt  I  C0UNT(y)  -T)5 

Nt+l  =  fy  e  Nt  I  C0UNT(y)  <  T}; 
and  RETURN; 
[2]   if  rt+1  =  PRE(x)  and  x  ft   S  and 

I  S^_  ( c )  |  <  c  then  (prepage  the  required  page) 

Vi  ■  ut 
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N.  -,  =  N.  +  x;  COUNT(x)  «-  0; 
t+1    t 

P    =  P  ;  and  RETURN; 
END  PRE5A; 
Theorem  3.5: 

PRE5A  is  a  stack  algorithm  for  (ue  N  |P(w)}. 

The  proof  is  very  similar  to  the  proof  of  theorem  3.*+  so  we 
omit  it  here. 

When  faced  with  the  choice  between  the  use  of  PREljA  and  PRE5A,  we 
recommend  the  use  of  FREljA.   However,  for  efficiency  in  performance  mea- 
surement, one  could  use  PRE5A.   In  the  following  theorem  we  prove  that  the 
performance  of  PRE^+A  is  no  worse  than  the  performance  of  PRE5A  which 
explains  the  comments  just  made  about  these  two  algorithms. 
Theorem  3.6 : 

jt(PRE^+A,  c,  w)  <  it(PRE5A,  c,  w). 
Proof:  We  will  first  prove  that  V  t  >  1, 

St(PREl+A,  C,  u>)  =  S  (PRE5A,  c,  u>), 
U  (PRE^A,  c,  o>)  =  Ut(PRE5A,  c,  w), 
Pt(PREl+A,    c,    u)  =3  Pt(PRE5A,    c,    w). 

From  these,  the  required  result  follows  since  a  page  fault 
occurs  when  r,  ,  =  x  e  S,  (c)  but  x  ^  P,  (c),  x  ^  U,  (c). 

The  proof  is  obtained  by  studying  the  inputs  and  outputs  to  the 
sets  P, ,  U,,  N, .  We  draw  a  set  input-output  diagram  in  which  three 
circles  are  the  sets  P.,  N  ,  U, .   The  arcs  are  the  input  and  outputs 
to  these  sets.   Labels  to  the  arcs  are  of  the  form  d/X  where  d  is  the 
step  of  the  algorithm  in  which  this  arc  is  relevant  and  X  is  the  set 
of  pages  to  which  this  arc  applies. 
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For  PREUA,  the  diagram  is 


For  PRE5A,  the  diagram  is: 


Note  that  the  following  relation  between  the  steps  of 

the  two  algorithms  holds : 

PRE^A.  (Id)  =  PRE5A.  (id)  U  PRE5A.  (le). 

and  PREUA.  (2)  =  PRE5A.  (2). 

=»  PREUA.  S  =  PRE5A.  St   V  t  >  1,  Vc>2. 

Input  to  U,  is  strictly  a  function  of  go  and  output  from  U^  occurs 
only  in  steps  l(d)  and  l(e)  therefore  by  the  above  step  relations, 
PREUA.  U  =  PRE5A.  U,  V  t  >  1,  Vc>2. 
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Then  clearly,  "by  the  arrows  going  from  N  to  P  we  conclude  that 

PREUa.    P    2  PRE5A.    P  Vt>l,    Vc>2, 

This  gives  the  required  result.  □ 

Note  that,  freeing  can  be  added  to  any  PREdA  d  =  1,  2,  3,  k,    5 
and  we  obtain  the  corresponding  FREEPREdA  with  a  better  performance, 
i.e.,  jr(PREdA)  >  jt( FREEPREdA).   We  can  also  prove  that  FREEPRE3A  and 
FREEPRE5A  are  stack  algorithms  for  (w  £  H    P(w)}.  We  can  also  show 
that  it  (FREEPREDA,  c,  co)  <  jt(FREEPRE5A,  c,  w)  which  means  that,  though 
FREEPREDA  is  recommended  for  use  in  practice,  performance  can  be  bounded 
by  that  of  FREEPRE5A  in  lesser  time. 

Note  that  each  of  these  prepaging  algorithms  can  be  reduced  to 
a  demand  prepaging  algorithm  by  removing  the  mechanism  of  time -dependent 
set-up  of  a  not- set-up  page  and  allow  set-up  only  on  the  occurrence  of 
a  page  fault.   To  implement  this  in  practice,  we  do  the  following:   on 
the  occurrence  of  a  PRE(x)  in  w,  a  prepage  bit  is  turned  on  in  the  page 
table  entry  associated  with  x.   At  the  time  of  the  next  page  fault,  say, 
for  a  page  y,  we  issue  page  pull  commands  for  y  and  all  the  pages  with 
prepage  bit  turned  on  (subject  to  the  memory  size  constraint).   We  will 
label  (FREE)PREdA,  after  the  above  transformation  as  (FREE)DPREdA.   We 
note  that, 

it(DPMIN,  c,  w)  <  jt(FREEDPREdA,  c,  w). 

We  now  have  to  show  that  these  new  paging  algorithms  do  achieve 
something  for  us  in  practice.  We  will  use  FREELRU  and  FREEDPREULRU  and 
study  the  behavior "of  these  two  algorithms  as  compared  to  LRU,  MIN  and 
DPMIN.   We  will  carry  out  our  experiments  on  several  matrix  problems  in 
Chapter  5  and  report  the  results  in  that  chapter. 
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3.2.2.1  Joseph's  OBL  Algorithm  [J0SE70] 

So  far,  we  have  only  discussed  deterministic  prepaging,  i.e.,  the 
prepaging  specified  by  the  programmer  or  compiler.   In  the  deterministic 
case,  we  assumed  that  the  prediction  is  always  correct.   Prepaging 
schemes  developed  by  Joseph  [J0SE70]  can  be  called  probabilistic 
prepaging  schemes.   These  schemes  are  implemented  in  the  operating 
systems  and  does  not  need  a  prescan  of  the  program.   They  are  based  on 
the  observation  that  a  program  tends  to  refer  to  its  pages  sequentially. 
Joseph  has  shown  experimentally  that  such  prepaging  schemes  do  reduce 
page  faults.  We  would  like  to  show  analytically  that  such  is  the  case. 
Clearly,  for  such  analytical  work,  we  should  have  a  program  behavior 
model.  We  will  define  a  sequential-random  model  of  program  behavior 
for  this  purpose.   Before  that,  we  will  define  a  demand  prepaging 
algorithm  based  on  the  OBL  algorithm  of  Joseph.  We  will  follow  the 
approach  of  King  [KING71]  to  analyze  the  behavior  of  this  algorithm. 

DPOBLLRU  =  (V,  Q,  N,  SQ,  qQ,  g),  where  V  =  {S|S  c  N,  |S|  =  c} 
the  set  of  memory  states.   SQ,  the  initial  memory  state,  qQ,  is  the  initial 
control  state,  Q  is  the  set  of  control  states  q  =  (j,,  j  ,  ...,  j  )  where 
j  £  N  (V  k  e  [1,  c]  )  and  j.  ^  j .  for  i  /  I  (i.e.,  they  are  all  distinct), 
g  is  the  transition  function,  g:  V  XQXN->  VXQ. 

r  (S,  q')        if  x  e  S 
g(S,  q,  x)  =   /   (S",  q")        if  x  ft   S,  x  ©  1  £  S 
^  (S,n,  qm  )      if  x  {   S,  x  ©  1  i   S. 
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Where  x  ©  1  denotes  x  +  1  mod  n, 

q'  =  ($!>      >>>    Jk_^  Jk+1>  >->    3Q,   x)  where  ^  =   x. 
S"  =  if  x  ©  1  /  j±   then  S  -  J_  +  x; 

else  S  -  jp  +  x; 
q"  =  if  x  ©  1  /  Jx  then  (jg,  . . . ,  oq,  x); 

eJ.se  \d-,j    J^>  •  ••*  0„^  x/j 
S,M  =  S  -  j1  -  j2  +  x  +  (x  ©  1); 
q'"  =  (x  ©  1,  J3,  . ..,  jc,  x); 
MD  DPOBLLRU; 

From  the  algorithm,  it  is  clear  that  given  q  =  (j'  ,  . ..,  j  )  we 
can  conclude  that  S  =  (j  ,  . ..,  j  }.   Therefore,  a  configuration  (S,  q) 
of  the  algorithm  is  completely  specified  by  specifying  q.   Therefore, 
the  set  of  configurations  is  the  set  Q.   Our  first  objective  is  to  seek 
the  number  of  states  in  Q. 

Definition:  A  state  q  e  Q  is  said  to  have  a  property  z  (i.e.,  z(q))  if 
3  x  e  N  such  that  x  and  x  ©  1  both  occur  in  q. 
Lemma  3.2: 

z(q0)  -*  V  q  e  Q,  z(q). 
Proof:  The  proof  is  by  induction.   In  the  transition  function  g  of 
DPOBLLRU,  assume  that  z(q)  and  we  will  prove  that  z(q'),  z(q")  and  z(q™  ). 
Observe  that  the  property  z  is  invariant  under  a  permutation  of  its 
argument.   Therefore,  z(q)  =>z(q')  since  q'  is  a  permutation  of  q. 

Since  x  ©  1  e  S  and  from  the  definition  of  S",  x  ©  1  €  S".  Also 
from  the  definition  of  S",  x  e  S"  therfore  z(q").   From  the  definition 
of  q'",  we  have  z(q'"  ).   The  result  of  the  lemma   immediately  follows.  □ 
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Lemma  3«  3  '• 

In  I    nv>  (n-c)!  n 

lQl  =  Pc  _   (n-2c)!(n-c) 

where  P  denotes  the  c-permutations  out  of  n  objects, 
c 

Proof:  Berge  [BERG71,  p.  31]  has  proved  that  the  number  of  subsets  of 
size  c  (from  the  set  N  of  n  elements),  which  do  not  contain  either  two 
consecutive  integers  or  both  1  and  n  simultaneously,  is  given  by 

***.t        \   /n_c\  P 
f*(n,c)  =  (    )  . 

v  '  '   v  c  y  n-c 

From  Lemma  3.2  it  follows  that  |q|  =  nP  -  c!f*(n,  c). 

Substituting  for  f*(n,  c),  we  get  the  required  result. 

It  has  been  mentioned  that  the  set  of  configurations  (Q,)  of 

DPOBLLRU  form  a  Markov  Chain.  Assume  that  the  transition  probability 

matrix  P  =  (p  ,  , )   for  the  Markov  Chain  is  given. 
q;q 

Lemma  3.^-s 

The  Markov  Chain  of  configurations  of  DPOBLLRU  is  irreducible 
provided  Pr[r,  =i]  ^  0,  V  i  e  N. 
Proof:   It  is  sufficient  to  show  that  V  q  e  Q,  V  q1  e  Q,  3  k  >  0,  such 

that  p  ,   >  0  where  p  ,   represents  the  probability  of  a  k-step 
transition  from  state  q'  to  state  q.   Starting  with  an  arbitrary  state 
qf  €  Q  we  will  show  a  finite  sequence  of  transitions  to  an  arbitrary 
state  q  e  Q  and  thus  we  will  prove  the  lemma. 
Let  q  =  (i,,  ...,  i  )  and 

qr  =  (j^  ...,  j  )  and  q  /  q'. 
By  applying  Lemma  3.2  to  q,  we  get,  3  i  e  N  such  that  both  i  and  i  ©  1 
occur  in  q.   Let  j„.=  i,  1<^<c,  and  i  =i©l,  l<s<c,  s  ^  £. 
Without  loss  of  generality,  assume  that  s  >  $,.     We  present  a  sequence 
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of  states  starting  from  q'  and  leading  to  q  and  strictly  following  the 
transition  function  g.   Clearly,  there  is  an  associated  reference 
string,  u)  ,  such  that  g*(q',  w  )  =  q.   Note  here  that  g*  is  an  extension 
of  g  and  that  we  have  omitted  the  memory  state  S,  since  it  is  completely- 
specified  by  the  control  state.   Now,  since  we  give  a  transition  from 
q'  to  q  in  a  finite  number  of  steps,  oo  is  a  finite  string.   And  since  by 
our  assumption  p(co  )  >  0,  we  get  the  required  result.   [Note  that  in  the 
following,  an  asterisk  means  unspecified  or  irrelevant  page  number.] 

qT  -*  q-L  =  {*, *>31)  x  =  j 


q-j_  -  Qg  =  (*, *,31}J2) 


x  =  J, 


H-2       ^£ 


qi-l  ~  (*"--*>Ji>J2',,*t^-l) 


qi-l  ""  qi  =  (*>•••*>  Ji>  J 


£■ 
2,,,,J"i-l,J'/g+l 


x 


'i-1 


x  =  J 


4+1 


qs-3  _>  qs-2  ~  (*"-*>Ji>***'^-l'^+l',*,J's-l') 


Ls-2 


qs-l  =  (**..**  J^... Jj.!^ 


i+r'^s-i^s+i- 


x  =   J 


x  =  J 


s-1 


s+1 


^-2 


qG-2  ~~  ^*'*'5l",^i-.l'J^.l',,,;'s-l''JB+l,,,,«Jc' 

qc-i  =  Wb>319-">3i_1j3 


£-l'd£+l'  •,,ds-l'Js+l 


>  •  •  •  j'c>  jjjj 


X 


x    =    J 


Thus  with  a  reference   string, 


31' 


'    Ji-1'    d4+l'    *•"    ds-l'    ds+l>    **'   dc' 


we  have  reached  the  state  q   .   Now  with  the  following  reference  string, 


u3  =  a*l'  ""  V  di+l'  "•'  Js'  ds+l 


•  •  •  j  J  > 


6o 

we  will  reach  the  state  q.   Note  that  all  the  intermediate  states,  while 
going  from  q    to  q,  are  permutations  of  q.   Thus  U)  =  U  u  is  the 
required  reference  string.   Clearly,  to  is  finite  and  therefore  we  have 
the  required  result.  □ 

We  define  the  sequential-random  model  of  program  behavior  as 
follows : 

Pr[rt+1=j | rt=l]  =  v±.   is  given  by: 


pio 


P-L     if  i  =  3, 


p2     if  j  =  i  ©  1, 
P      otherwise 


We  require  that  p.  >  Pp  >  p„  >  0  and  P,  +  Pp  +  (n-2)  p  =  1.  We  note 
that  this  model  is  closer  to  the  behavior  of  real  programs  as  compared 
to  the  independent  reference  model. 
Conjecture  3.1: 

m(DP0BLLRU,  c)  <  m(LRU,  c)  for  the  sequential-random  model  of 
program  behavior.  Where  m(A,  c)  indicates  the  long  term  average  page 
fault  probability  working  with  algorithm  A  and  with  c  page  frames  of  MM. 

Unfortunately,  we  have  not  been  able  to  prove  this  theorem. 
However,  we  can  prove  that  m(DP0BLLRU,  2)  <  m(LRU,  2). 

3.3  PWS  Algorithm 

We  assume  that  the  program's  reference  string  is  fully  known  in 
advance.  With  this  assumption,  we  will  propose  a  new  paging  algorithm, 
PWS  which  will  incur  zero  page  faults,  on  the  average.  We  will  also 
derive  the  expressions  for  page  pull  probability  and  the  average  number 
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of  memory  page  frames  required  and  compare  the  results  with  WS  algorithm 
of  Denning  [DENN68a] . 

Assume  that  w  =  r, ,  . ..,  r  is  the  given  reference  string.   Define 
the  'futuristic'  working-set  W*(t,  T)  by: 

W*(t,  X)   =  {x  e  N  |  at',  t  <  t'  <  t  +T  ,  r.,  =  x). 

In  other  words,  W*  is  the  contents  of  a  window  of  size  T  ,  looking  forward. 
The  purpose  of  algorithm  PWS  is  to  keep  W*  in  MM.   At  any  time  t,  if 
r,  T  e  W*(t, 7   )  then  nothing  needs  to  be  done;  otherwise,  we  issue 
instructions  to  fetch  this  page  and  thereby  prepage  it.  We  also 
determine  whether  r,  ,  e   W*(t,  T )  and  if  not,  we  either  push  it  from 
MM  or  declare  it  dead. 

If  it  is  assumed  that  T  >  T,  where  T  is  the  page -fetch  time, 
then  since  a  fetch  instruction  for  a  page  is  issued  T  time  units  in 
advance  of  its  use,  we  will  have  no  page  faults  whatsoever.   In  other  words, 
we  have  m(l¥S,  T  )  =  0  if  1   >  T. 

We  will  now  study  properties  of  this  model.   From  the  above 
definitions  the  following  properties  can  be  proved.   (We  denote  |w*(t,T  )| 
by  u*(t,CT).) 
PI:  a)  W*(t,  0)  =  jZ5  which  implies  w*(t,  0)  =  0 

b)  W*(L,  J)   -   {rL},  V  T  >  0. 

c)  w*(t,  J )  is  a  monotonically  non-decreasing  function  of  J~  . 

d)  tjj*(t,  J)  is  concave  downwards.   To  see  this,  note  that, 

W*(t,  2-7)   =   W*(t,0"  )  U  W*(t+  T,  J). 

Therefore,        .  w*(t,  2'J)  <   w*(t,  J  )  +  u»(t+CT,  j), 
since  the  working  sets  may  overlap. 
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Assuming  statistical  regularity,  w*(t+  7, 1  )  behaves,  on  the 
average,  like  w*(t,  7) 


therefore, 


u*(t,  27)  <  2w*(t,  T). 


Hence  we  have  the  concavity  property, 
e)  If  t  <  L  -  1    then  W*(t,  7")  =  {x  e  N  |  x  =  r,  ,,  t  <  t  *  <  L} . 

Let  us  denote 

1  k 
k       k  t=l 

And  let  the  average  working  set  size  s*(T)  =  s*(T).   The  page  pulling 

rate  m*(T)  measures  the  number  of  pages  per  unit  time  returning  to  the 

working  set  (note  that,  m*(T)  will  be  an  upper  bound,  since  a  page  which 

has  left  W*(t,  X  )  may  still  be  in  MM). 


Let  A(t,  7 )  = 


if 


t+7 


i  W*(t,T) 


J        0   otherwise.  (t  >  l) 

Note  that,  A(t,  j)  =  0  V  t  >  L  -  7   .   Let  m*  =  A(0,  T)  =  {r^  ...,  r^  } 

The  page  pulling  rate  is  given  by: 


m*(T)  =  Lim.   if 
k-*L 


k-1 
Z  A(t,T)  +  m* 
Lt=l 


0 


The  interference  distribution  F. (x)  and  the  interference  density  function 

f.(x)  as  well  as  the  relative  frequency  of  reference  X.    have  the  same 

meaning  as  in  Denning' s  work  which  is  outlined  in  Chapter  2.   Let  the  mean 

n 

overall  reference  interval  x  =  L  X.   x. ,  x.  =  Z  x  f. (x).   The  page  i  is 

x=l  x>0 

called  'recurrent'  if  \.  ^   0  and  j     denotes  the  number  of  recurrent  pages  in  N. 

If  \.   f   0  then  x.  =  i/\  . 
i  i    '  i 
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P2  :   s*(T)  is  nondecreasing  in  7  and  is  bounded  above  and  below: 

1  =  s*(l)  <  s*(T)  <  s*(T+l)  <  min.{n,  7  +1). 

Proof:  P(t,j)cP(t,T+l)4  w*(t,  7)  <  w*(t,  7+1). 

=>  s*(T)  <  s*(T+l).  Also, 

w*(t,  1)  =  1=»  s*(l)  -  1. 

w*(t,  7+1)  <  min.  {n,  J  +1}  =$  s*(j  +l)  <  min.  {n,  7  +1) . 

m* 
P3:  s*(7+l)  -  s*(7)  =  m*(T)  -  -£  . 

m0 
Note  that  -=-  is  a  very  small  number  compared  to  m*(7)  and  can  be 

ignored  for  most  purposes,  particularly  if  L  is  very  large. 
Proof: 

W*(t,  7+1)  -  W*(t,  J)  E  (rt+T}* 

Therefore,  u*(t,  7+1)  -  w*(t,  7)  =  A(t,  7  ). 


Therefore,    Lt 
k-+L 


•   k  ±      k       1 

-   Z  u*(t,  T+l)  -  -   Z  w*(t,  7) 
.   t=l  k  t=l 


=  Lt  rr        I   A(t,  7  ). 


k->L 


t=l 


Therefore,   s*(7+l)  -  s*(7)  =  Lt 

k->L 


r\  k-1 
t-       Z  A(t,  7  )  +  m* 


t=l 


m* 


+  A(L,7)   _  ^0 

L        L  * 


mo 


m*(T). 


D 


6k 

y 
PU:     -£  <  m*(7+l)  <  ™*(t)  <  m*(0)   =  1. 

which  states  that  m*(    )  is   a  nonincreasing  function  of   T  and  is  bounded 

above  and  below.      This  follows  immediately  from  P2  and  P3. 

P5:     m*(T)  =  1  -   F(T)  +  -j% 

which  states  that  m*(T)  can  be  regarded  as  the  probability  that  x  >  "J  . 

Proof:   Define  the  binary  variable  B.  (t,  x)  =  1  iff  r     =i;  ^u<t+J 

such  that  r  =  i  and  t  +  T  -  u  <  x.   Define  n. (k)  as  the  number  of 
u  —  1 

references  to  page  i  in  r  ,  ...,  r  .   The  definition  of  P(x)  can  be 

1  K. 

expressed  as: 


1 


n  n  n. (k) 

F(x)  =  Z  X.    F.  (x)  =  Z  Lim.  -~ •  —7—7 

.-,11  k     n.  (k)-l  ,  _  wi 

i=l  i=l  k->L  1 v  '   t=l 


k-T 

Z     ft.  (t,  x) 


1 


k-T   n 


Lim.  -  Z    Z  6.  (t,  x) 
k-»L    t=l  1=1  1 


since  n.  (k)-l 
1 

~  n.(k) 


n 


Define  e(t,  x)  =  Z  6.  (t,  x)  and  observe  that  fi(t,  x)  =  1  -  A(t,  x), 


i=l 


k-T 


Therefore,  1  -  F(x)=Lim.  -  Z   A(t,  x) 

k-*L    t=l 


=  m*(x)  -  m0*/L. 
This  proves  P5. 
P6:  m*(T+l)  -  m*(T)  =  -  f(T+l). 

This  follows  immediately  from  P5. 


T-l 


T-l 


mo. 


P7:   s*(T)  =  2  m*(z)  =  Z   (l  -  F(z)  +  -^) 
z=0         z=0  L 


T-l 
=  Z    Z  f(y)  +  m^nT/L. 
z=0  y<z 


0 
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This  follows  immediately  from  P5  and  P6. 

Since  m*  T/L  is  very  small  for  large  values  of  L  (i.e.,  for  long 

T-l 
reference  strings),  s*(T)  ~  2  '   ^  f(y)« 

z=0  y<z 

Theorem  3.8: 

a)  m(PWS,  T  )  =  0    if  J  >  T, 

b)  m*(lWS,  T  )  2  m(WS,  T  ), 

c)  s^(PWS,  ?   )  ~  s(WS,  T  ), 

d)  ST(PWS,  T  )  «  ST(WS,  J   ). 

Proof:   Part  a  of  the  theorem  has  been  already  proved  when  we  introduced 
the  algorithm  PWS.   Part  b  follows  from  the  property  P5  of  PWS  and  the 
property  PU  of  WS,  given  in  section  2.  k-        Part  c  follows  from  the 
property  P7  of  PWS  and  the  property  P6  of  WS,  given  in  section  2.  k» 
Finally,  the  last  part  can  be  shown  as  follows : 

ST(PWS,  7  )  =  2  w*(x)  *(ot*) 

where  St*  =  average  execution  burst  plus  the  average 

page  wait  time  plus  the  average  ready  queue  delay. 

w*(*7")  is  the  average  working  set  size  during  the 

interval  St*. 
Similarly,  ST(WS,  *T )  =  Z  w(T)  *  t. 

Now  since  approximately  w*(T  )  ~  co(t  )  and  6t*  «  6t  since  we 
have  no  page  faults  in  PWS,St*~  average  execution  burst.   Therefore, 
we  have  the  required  result.  □ 

Implication  of  Theorem  3.8  is  that  we  have  defined  a  paging 
algorithm  with  zero. page  faults  which  pulls  the  same  number  of  pages 
as  the  WS  algorithm  and  which  requires  the  same  average  working  set  size 


66 

as  WS.   We  have  also  proved  that  the  memory  utilization  (ST  product)  of 
EWS  is  highly  superior  to  that  of  WS.   The  main  drawback,  however,  is  that 
IWS  is  unrealizable.   If  partial  information  is  available  about  future 
working  sets  then  we  may  be  able  to  utilize  it  to  improve  performance 
over  WS  and  at  least  approach  the  performance  of  PWS.   We  feel  that 
further  investigations  should  be  made  in  this  direction.  We  will  add 
some  comments  to  this  topic  in  Chapter  6. 

We  have  ignored  several  points  in  the  above  discussion  of  PWS 
which  we  now  discuss.   Define  the  residency  of  a  page  in  MM  as  the 
fraction  of  time  it  is  potentially  available  in  MM.  We  assume  that  a 
page  in  the  working  set  is  never  replaced.   Once  a  page  has  entered 
W*(t,  T  ),    it  will  remain  in  MM  for  at  least  T  time  units.   Let  x  be 
the  interreference  interval  for  the  page  i.  We  have: 

1)  If  x  <  J ,    the  page  residency  is  IQQPjo. 

2)  If  7"  <  x  then  the  following  diagram  gives  us  the  page 
residency. 


x 


-> 


t-T 


t+T 


t+x 
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If  during  the  last  residency,  the  page  was  updated,  then  the 
above  diagram  requires  that  t+x-y>t+T  in  order  for  the  page 
to  be  released  at  time  t.  Which  means  that  if  x  >  T  +7  then  only  the 
page  i  can  leave  memory.   Otherwise  (i.e.,  T  <  x  <  T  +T  )  then  if  the 
page  was  allowed  to  leave  memory  and  called  again  after  x  time  units  then 
the  contents  of  page  i  will  be  the  old  contents  of  page  i  (i.e.,  before 
the  update  during  the  last  residency).   Therefore,  in  such  a  case,  the 
page  must  be  kept  in  MM  and  it  will  always  be  resident.   This  means 
that  more  than  uj*(t,  T)  pages  will  occupy  MM.   To  implement  this  in 
practice,  we  assume  that  the  reference  string  w  =  r,,  r  ,  ...,  r  is 
such  that  r,  =  (y,  k)  where  y  e  N  is  the  current  page  referenced  and  k  is 
the  time  to  next  reference  to  y.  The  value  of  k  will  be  zero  if  page  y 
if  not  referenced  again.   The  modified  PWS,  therefore,  can  be  written 
as  follows : 

MEWS: 

(assume  y   >  T) 

Vt  >1         let  rtwl  =  (y,  k)   and     rt+/J   =  (y',  k'). 

1)  If  k  =  0  or    k>  T  +  T      then 

FREE(y) 

2)  If  y*£   W*(t,  7  )  then  PREPAGE  (yf). 
EHD  MPWS; 
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k.      IMPROVING  LOCALITY  OF  ARRAY  PROGRAMS 

k.O     Introduction 

Locality  of  programs  can  "be  improved  in  various  ways,  as  noted 
in  Chapter  2.   By  proper  organization  of  data  among  the  data  pages,  we 
can  reduce  the  size  of  the  localities  as  well  as  the  probability  of  an 
interlocality  transition.   This  has  been  termed  the  problem  of  data 
pagination.   For  matrix  algorithms  operating  on  large  arrays,  we  need 
to  consider  various  techniques  for  storing  them,  such  as,  row  major 
order,  column  major  order,  packed  row  storage,  packed  column  storage, 
or  submatrix  (block)  storage.   Several  investigations  have  indicated 
that  submatrix  storage  is  preferable  to  other  storage  schemes  [MCKE69, 
ROGE73].  We  shall  restrict  our  investigation  to  the  submatrix  type  of 
organization.   For  large  matrix  problems,  it  should  be  clear  that 
instruction  paging  is  dominated  by  data  paging;  for  this  reason  we 
ignore  instruction  paging. 

With  respect  to  improving  locality,  it  is  possible,  in  certain 
instances,  to  resequence  some  of  the  operations  in  a  program  so  that  we 
operate  on  the  data  that  is  already  resident  in  MM  rather  than  execute 
unnecessary  paging.   This  amounts  to  a  logical  reorganization  of  the 
code  as  against  pagination,  which  does  a  physical  reorganization.   In 
this  chapter,  we  will  include  a  discussion  of  some  of  these  methods  of 
logical  reorganization  for  matrix  programs.  We  will  assume  that  these 
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modifications  are  done  by  the  programmer.   The  problem  of  automating  some 
of  these  methods  will  be  considered  in  Chapter  6. 

Since  we  are  interested  in  improving  locality,  we  are  also 
interested  in  measuring  these  improvements.  We  feel  that  the  average 
working  set  size  s(*X)  as  a  function  of  T  is  a  very  good  indication  of 
the  locality  of  a  program.  We  also  measure  the  number  of  page  faults 
using  LRU,  WS  and  MIN  paging  algorithms. 

We  will  present  three  techniques  for  the  improvement  of  locality 
of  matrix  programs.  We  will  illustrate  their  use  by  applying  them  on 
several  common  matrix  algorithms.   We  will  show  that  an  order  of  magnitude 
improvement  in  locality  can  be  obtained  by  these  methods. 

We  now  consider  the  notation  for  storing  matrices  in  VA  space. 
Let  A  be  a  rectangular  matrix  of  size  (n,Xnp).   The  matrix  will  be 

divided  into  square  submatrices  of  order  m  as  shown  in  Figure  k.l.      If 

2 
we  assume  that  p,  the  page  size,  is  a  perfect  square  and  that  m  =  p 

then  each  of  the  submatrices  can  be  stored  in  one  page.   Note  that  there 

will  be  some  submatrices  On  the  right  and  the  bottom  borders  which  will 

not  be  full.  We  will  assume  that  n,  and  np  are  both  multiples  of  m 

and  therefore,  such  fragmentation  will  not  occur.   If  the  matrix  A  is 

a  square  matrix  then  n  =  np  =  n.  We  define  the  integers  N,  N,  and  Np 

through  the  equations  n  =  N.m,  n,  =  E,  «m  and  np  =  Np«m.   Therefore,  the 

number  of  data  pages  occupied  by  a  n  Xn  matrix  is  N  «N  ,  by  a  nXn 

matrix  is  N  and  by  a  triangular  matrix  of  order  n  is  N(N+l)/2. 

*+.  1  Cholesky  Decomposition 

We  start  with  a  particular  matrix  algorithm,  namely,  the  Cholesky 
factorization  of  a  symmetric,  positive  definite  matrix.  We  will  try  to 
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Figure  k* 1.   Submatrix  Storage 
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improve  the  locality  of  this  algorithm  in  various  ways.   This  will,  then, 

lead  us  to  general  methods  of  locality  improvement.  We  will  try  these 

methods  on  same  other  typical  matrix  algorithms. 

Cholesky  decomposition  factors  a  symmetric  positive  definite 

matrix  A  into  the  product  GXG'  where  G  is  a  lower  triangular  matrix  and 

G1  denotes  the  transpose  of  G  [WILK71] .   The  algorithm  is  defined  by 

the  following  relations : 


k-1    2 
§kk  =  sqrt(akk  "  ;=  gij  } 


for  1  <  k  <  n 


ik  =    (aik  "      E1    gjk  SikVSkk' 
.1=1 


0 

Where  A  =  fa. .1  is  an  nXn  matrix,  and  G  =  fg. .1  is  a  lower  triangular 
matrix  of  order  n.  We  recall  that  the  basic  algorithm  in  PL/ I  like 
language  is  as  in  Figure  k.2.      Note  that,  we  operate  only  on  the  lower 
triangular  part  of  A  and  we  overwrite  G  onto  A.  We  have  omitted  all 
declarations,  input,  output  and  such  other  statements  from  the  program. 
Note  that,  for  our  measurements,  we  will  use  a  matrix  (A)  of  order 

n  =  2k,    a  page  size  p  =  16  (=>  m  =  k).     We  note  that  the  first  inner  loop, 

th 
'DO  J  =  1  TO  k  -  1',  scans  the  k   row  of  the  matrix  A.   This  is  done 

with  each  traversal  of  the  major  loop  (with  a  different  value  of  k). 

Since  several  (in  fact,  m)  rows  are  stored  across  in  a  row  of  pages,  it 

is  clear  that  the  locality  will  be  improved  if  we  alternately  traverse 

the  rows  in  opposite  directions.  We  observe  that  these  traversals  are 

possible  because  the  order  of  computation  within  the  loop  is  immaterial. 

By  appropriately  reversing  the  other  two  loops,  we  will  improve  the 

locality  in  a  similar  way.   The  resulting  program  is  labelled  CDR,  which 

is  given  in  Figure  k.3. 
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CD:  /*  CHOLESKY  DECOMPOSITION  */ 
DO  k  =  1  TO  n; 
S  -  0; 
DO  J  =  1  TO  k  -  1; 

S  =  S  +  A(k,  J)  *  A(k,  j); 
END; 

A(k,  k)  =  SOJRT(A(k,  k)  -  S); 
DO  I  =  1  TO  n  -  k; 
S  =  0; 
DO  J  =  1  TO  k  -  1; 

S  =  S  +  A(k+I,  J)  *  A(k,  J); 
END; 

A(k+I,  k)  =  (A(k+I,  k)  -  S)/A(k,  k); 
END; 
END  CD; 


Figure  1^.2.   Cholesky  Decomposition 
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CDR: 

DO  k  =  1  TO  n; 
S  =  0; 
IF  MOD(k,  2)  =  1  THEN 

DO;  JLL  =  1;  JHL  =  k  -  1;  JSTEP  =  1;  END; 

ELSE 
DO;  JLL  =  k  -  1;  JHL  =  1;  JSTEP  =  -1;  END; 
DO  J  =  JLL  TO  JHL  BY  JSTEP; 
S  -  S  +  A(k,  J)  *  A(k,  J); 
END; 

A(k,  k)  =  SQRT(A(k,  k)  -  S); 
IF  MOD(k,  2)  -  1  THEN 

DO;  IHL  =  n  -  k;  ILL  =  1;  ISTEP  =  1;  END; 

ELSE 
DO;  IHL  *  1;  ILL  =  n  -  k;  ISTEP  =  -1;  END; 
DO  I  =  ILL  TO  IHL  BY  ISTEP; 
IF  MOD  (I,  2)  =  1  THEN 

DO;  JLL  =  1;  JHL  =  k  -  1;  JSTEP  =  1;  END; 

ELSE 
DO;  JLL  =  k  -  1;  JHL  =  1;  JSTEP  -  -1;  END; 
S  =  0; 
DO  J  =  JLL  TO  JHL  BY  JSTEP; 

S  =  S  +  A(k+I,  J)  *  A(k,  J); 
END; 

A(k+I,  k)  -  (A(k+I,  k)  -  S)/A(k,  k); 
END  CDR: 


Figure  I4..3.   Cholesky  Decomposition  with  Reversal 


Ik 

In  Figure  h.k,   we  have  plotted  the  average  working  set  size  s(0") 
as  a  function  of  the  window  width  1   for  CD  and  CDR.   The  improvement  of 
locality  due  to  the  reversal  technique  is  clearly  shown. 

The  locality  of  a  program,  such  as  CD,  may  he  improved  more 
significantly  if  we  extend  the  algebra  of  the  language  to  include 
submatrix  operations.  We  sketch  the  program  using  the  0L./2  language 
[PHIL72].   The  names  of  all  OL/2  programs  will  be  prefixed  by  the  letter 
'0'.   In  Figure  h-5,   we  give  the  program  OCD. 

The  order  of  the  element  by  element  computation  of  (C  -  M  X  R') 
is  unspecified  in  OCD.   Since  a  matrix-vector  multiplication  is  specified 
by  H  x  R1,  we  have  the  choice  of  implementing  the  multiplication  by  a 
submatrix  multiplication  and  thereby  improve  the  locality.  We  will 
briefly  explain  what  we  mean  by  submatrix  multiplication.  Assume  we 
are  carrying  out  Z  =  X  *  Y.  Where  X,  Y  and  Z  are  matrices  of  order  n. 
We  shall  denote  the  (I,  j)  submatrix  of  the  matrix  X  by  the  usual 
subscript  notation,  e.g.,  X  .  Multiplication  of  X  and  Y  can,  therefore, 


be  written  as  follows : 


N 

\        =     T      X    *   Y 
IJ   ,\    AIK    KJ 
k=l 


1  <  I,  J  <  N. 


Note  that,  the  sum  and  product  in  the  above  equation  denotes  the  matrix 
sum  and  matrix  product  for  matrices  of  order  m.  We  apply  this  submatrix 
operation  to  modify  CD  and  obtain  the  results  shown  in  Figure  h.7   for 
the  program  CDM  in  Figure  k.6. 

We  have  plotted,  in  Figure  k.7,    s(t)  vs.  J  for  CD  and  CDM. 
We  note  that  the  improvement  obtained  by  this  method  is  much  more  than 
that  obtained  by  the  method  of  loop  reversal. 
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CDR 


Figure  k>k*      Localities  of  CD  and  CDR 
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OCD:      PROC(A,    n); 
LET  A  BE  A  LOWER  TRIANGULAR  MATRIX  OF 
ORDER   (n); 

TOR  K  =  1,    2,    ...,    n;    PARTITION  A  AFTER  ROWS 
K  -  1,    K; 
SET  R  =  A  <2,1>  ROW  VECTOR, 
M  =  A  <3, 1>  , 

D  =   SQRT(D  -    (R*,    R'))j 
C   =    (C   -  M  X  R' )/D; 
END  OCD; 


D   =  A  <2,2>  SCALAR,     • 

C   =  A  <3,2>  COLUMN  VECTOR: 


K-l 

K 


Figure  k.  5-      OL/2  Program  for  Cholesky  Decomposition 
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CDM: 
FM  =   FLOAT  (M); 

DO  K  =  1  TO  n;  KIM  =  (K-l)/M;  KCM  =  CEIL(K/FM); 
S  =  0; 
DO  J  =  1  TO  K  -  1; 

S  =  S  +  A(K,  J)  *  A(K,  J); 
END; 

A(K,  K)  -  SQ£T(A(K,  K)  -  S); 
DO  J  =  1  TO  K1FM; 
DO  I  =  K+l  TO  KCM  *  M; 
S  =  0; 
DO  Jl  =  1  TO  M; 

S  =  S  +  A(I,  (J-l)  *  M  +  Jl)  *  A(K,  (J-1)  *  M  +  Jl); 
END; 

A(I,  K)  =  A(l,  K)  -  S; 
END; 
END; 
DO  I  =  K+l  TO  KCM  *  M; 
S  =  0; 
DO  J  =  K1FM  *  M  +  1  TO  K  -  1; 

S  =  S  +  A(I,  J)  *  A(K,  J); 
END; 

A(l,  K)  =  (A(I,  K)  -  S)/A(K,  K); 
END; 
DO  I  =  KCM  +  1  TO  N; 
DO  J  =  1  TO  K1FM; 
DO  II  =  1  TO  M; 
S  =  0; 

DO  Jl  =  1  TO  M; 
S  =  S  +  A((l-1)  *  M  +  II,  (J-1)  *  M  +  Jl)  * 
A(K,  (J-1)  *  M  +  Jl); 
END; 

A((I-1)  *  M  +  II,  K)  =  A((l-1)  *  M  +  II,  K)  -  S; 
END; 
END; 

DO  II  =  1  TO  M; 
S  =  0; 
DO  J  =  K1EM  *  M  +  1  TO  K  -  1; 

S  =  S  +  A((I-1)  *  M  +  II,  J)  *  A(K,  J); 
END; 

A((I-1)  *  M  +  II,  K)  =  (A((I-1)  *  M  +  II,  K)  -  S)/A(K,  K); 
END  CDM; 


Figure  1^.6.   Cholesky  Decomposition  with  Submatrix  Multiplication 
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Figure  If.  7.   Localities  of  CD  and  CDM 
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We  now  turn  to  an  even  better  technique  for  improving  the  locality. 
This  requires  the  construction  of  a  submatrix  algorithm  and  we  illustrate 
it  for  the  Cholesky  factorization  [H0US6^] . 

Gn  =  CHDLDEC  <*n  "  %.  (Gu  X  ^r)} 


for 

1  <  I  <  N. 


for  I  <  J  <  N 

We  now  present  the  OL/2  program  for  this  algorithm.  A  PL/l 

program,  CDS,  is  given  in  Appendix  C. 

OCDS:  PROC; 

FOR  K  =  1,  2,  . ..,  N;  PARTITION  A  AFTER 
ROWS  m  *  (K-l),  m  *  K; 
SET  R  =  A  <2,1>     ,     D  =  A  <2,2>, 

C  =  A  <3,2>     ,     M  =  A  <3,1>; 

D  =  D  -  R  X  R« 

CALL  OCD  (D,  m); 

C  =  (C  -  M  X  R')  X  (D*)"  5 

EEL  OCDS; 

Note  that,  D  is  a  lower  triangular  matrix  of  order  m,  R  is  a 

rectangular  matrix  of  size  m  X  m(k-l).   Similarly,  C  is  (N-m)k  X  m  and 

M  is  (N-m)  k  X  m(k-l)  matrices  respectively.   To  carry  out  the  Cholesky 

factorization  of  D,  only  one  page  is  involved,  and  therefore,  any  method 

can  be  used  without  changing  the  performance.   Therefore,  we  use  OCD 

for  simplicity.   Based  on  OCDS,  we  write  CDS  given  in  Appendix  C.   We  carry 

out  the  multiplication  M  X  R'  by  submatrices.   We  store  the  inverse  of  D' 

in  the  strictly  upper  triangular  part  of  D.   This  way,  we  will  not  be  able 

to  store  the  diagonal  elements  of  (D')~  .   Note  that,  this  storage 

scheme  is  possible  because  the  page  containing  the  subarray  D  has 
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precisely  m(m-l)/2  words  vacant.   Also  note  that,  the  diagonal  elements 
of  (D* )   are  the  reciprocals  of  the  corresponding  diagonal  elements  of  D. 

In  Figure  1+.8,  we  have  plotted  s(t)  vs.  T  for  CD  and  CDS.   The 
large  improvement  in  locality  is  quite  obvious.  We  note  that  the 
reversal  technique  can  be  applied  to  both,  CDM  and  CDS  obtaining  CDMR 
and  CDSR  respectively.   Improvements  thus  gained  are  not  significant 
enough  to  report.   In  Figure  1^.9,  we  have  plotted  s(T)  vs.  T  for  CD, 
CDR,  CDM  and  CDS.   It  can  be  easily  seen  that  the  versions  of  Cholesky 
factorization  in  order  of  increasingly  better  locality  are  CD,  CDR,  CDM 
and  CDS. 

In  Figure  In  10,  we  have  plotted  the  number  of  page  fault,  jt(t)  vs. 
7  for  CD,  CDR,  CDM  and  CDS  using  WS  paging  algorithm.   In  Figure  k.H 
and  Figure  k.12,   we  have  plotted  it(c)  vs.  c  (the  page  allotment)  for 
CD,  CDR,  CDM  and  CDS  using  LRU  and  MIN  paging  algorithms  respectively. 
It  is  clear  from  these  figures  that,  improved  locality  implies  a 
reduction  in  page  faults. 

k-2     Other  Matrix  Algorithms 

From  the  example  of  Cholesky  decomposition,  we  observe  that  there 
are'  three  general  methods  of  locality  improvement  of  matrix  algorithms. 
The  first  method  is  the  method  of  loop  reversal  in  which  we  change  the 
direction  of  loop  traversal  each  time  we  traverse  a  loop.   We  note  that, 
depending  on  the  computation  within  the  loop,  this  may  not  always  be 
possible.  The  second  method  is  to  carry  out  all  matrix  multiplications 
occurring  within  the  algorithm  as  submatrix  multiplications.   The  third 
method  consists  in  using  a  submatrix  algorithm  from  the  beginning.  We 
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Figure  k.9.      Localities   of  CD,   CDR,   CDM  and  CDS 
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Figure  l^..  10.   Page  Faults  Using  WS  Algorithm 
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Figure  If.  11.   Page  Faults  Using  LRU  Algorithm 
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Figure  q.,12.   Page  Faults  Using  MIN  Algorithm 
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note  that,  for  several  matrix  algorithms,  their  submatrix  versions  already 

exist  in  the  literature.   However,  for  eigenvalue  problems,  such  submatrix 

algorithms  do  not  exist  and  probably  will  not  be  found  in  the  future. 

We  believe  that  the  three  methods  can  be  usefully  applied  to  most  matrix 

algorithms.  We  will  consider  several  more  algorithms  in  this  chapter. 

We  note  that,  the  conclusions  regarding  the  performance  improvements  by 

these  methods  hold  for  any  matrix  size  and  any  page  size,  although  we 

have  considered  only  one  matrix  size  and  one  page  size. 

If. 2.1  Matrix  Multiplication 

Three  versions  of  matrix  multiplication  we  consider  are :  MM 
(basic  matrix  multiplication),  MR  (matrix  multiplication  with  loop 
reversal,  and  MMS  (submatrix  multiplication). 

In  Figure  U.13,  we  plot  s(  T)  vs.  *J    for  these  three  programs. 

k-2.2     LU  Decomposition 

Given  a  nonsingular  matrix  A,  it  can  be  uniquely  decomposed 
into  L  x  U  where  L  is  a  lower  triangular  matrix  and  U  is  unit  upper 
triangular.   The  method  is  known  as  Crout  EQ  decomposition,  and  can  be 
written  as  [ISAA66] : 

for  k  as  1  to  n  do; 

for  i  =  k  to  n  do: 
k-1 

I..    =  d..  -  z  I.  .   u..  ; 

ik    ik       ij  jk5 

end; 

for  j  =  k+1  to  n  do; 
k-1 

\j  =  Gakj  ".f/kiV/^; 

end; 
end: 
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Figure  1+.13.   Localities  of  MM,  MMR  and  MMS 
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A  straightforward  implementation  of  this  will  be  called  IU. 

Appropriate  loop  reversal  turns  it  into  IUR.   To  obtain  LUM,  we  must 

identify  matrix  multiplications  within  the  algorithm.   Towards  this 

end,  we  write  the  algorithm  in  OL/2  and  call  it  OLU.   Note  that,  we 

store  L,  U  over  A. 

OLU:   PROC(A,  n); 
LET  A  BE  A  MATRIX  OF  ORDER (n); 
FOR  K  =  1,  2,  ...,  n;  PARRITION  A  AFTER  ROW 
K-l  AND  AFTER  COLUMNS  K-l  and  K; 
SET  C  =  A  <2,1>,  B  =  A  <1, 3>,  X  =  A  <1,2>,  Y  =  A  <2,2>, 

M  =  A  <2,3>; 
PARTITION  C,  Y,  M  AFTER  ROW  1; 
SET  R  =  C  <1>,  D  =  Y  <1>,  Z  =  M  <1>; 

Y  =  Y  -  C  *  X; 
Z  =  (Z  -  R  *  B)/D; 
END  OLU; 


K-l 


K 


B 


K-l 


K 


R 


D 
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From  OHJ,  we  see  that  C  *  X  is  a  matrix-vector  product  and  R  *  B 
is  a  vector-matrix  product.   Coding  these  products  as  submatrix  products 
(as  in  CDM)  we  obtain  HJM  (we  omit  the  detailed  coding  here).   Submatrix 
LU  decomposition  algorithm  can  be  written  as  [ISAA66] 
for  K  =  1  to  N  do; 

(LKK>  V  -  1U-DECV  "  y,    LKJ  V' 

J— 1 

for  I  =  K  +  1  to  N  do; 

LIK  -   (AIK  "   .  LIJ  V  *  'V^ 

J  — X 

end; 

for  J  =  K  +  1  to  N  do; 

UKJ  -  ^Kk'"1  *  (AKJ  "  £  LKI  V; 
end  K; 

This  can  easily  be  programmed  into  OL/2  as  well  as  PL/ I,  obtaining 

OLUS  and  LUS  respectively.   In  both  cases,  we  store  L  and  U  on  the 
original  storage  of  A.  While  programming  LUS,  we  store  the  inverses 
of  L^.  and  U   on  a  scratch  page  (only  one  scratch  page  is  needed).   Once 
again,  we  do  not  give  the  detailed  codes  for  these.   In  Figure  k.lk,   we 
have  plotted  s(T)  vs.  y   for  LU,  LUR,  LUM  and  LUS.  We  have  used 
n  =  2k   (  =  order  of  the  matrix  A),  page  size  p  =  16  (implies  m  =  k). 
Thus  the  number  of  pages  occupied  by  A  is  6  X  6  =  36  (in  general,  IT). 

U.2.3  Gaussian  Elimination 

Gaussian  elimination  reduces  a  nonsingular  matrix  A  into  an 
upper  triangular  matrix.   Gaussian  elimination  with  coefficient  storage 
can  be  written  as  [MCKE69]  : 
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Figure  i4_.Hi-.   Localities  of  LU,  LUR,  LUM  and  LUS 
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for  k  =  1  to  n  -  1  do; 
for  i  =  k  +  1  to  n  do; 


J\\ 


aik  ~  aik'  °kk' 
end; 
for  j   =  k  +  1  to  n  do; 

for  i  =  k  +  1  to  n  do; 

a.  .  s  a.  .  -  a.,    *  a,  .    . 

ij  ij  ik         kj    > 

end; 
end; 
end; 

A  straightforward  implementation  of  this  algorithm  is  called  GOS   (we  omit 
the  coding  here).      From  GOS  with  loop  reversal,   we  easily  obtain  GOSR 
(code  omitted  here).      To  see  the  matrix  multiplications  within  GOS,   we 
write  it  in  OL/2   and  call  it  OGOS. 


OGOS:      PROC(A,    n); 
LET  A  BE  A  MATRIX  OF  ORDER (n); 
FOR  K  -  1,    2,    . ..,    n-1;   PARTITION  A  AFTER 
ROWS  K-l,    K  AND  AFTER  COLUMNS  K-l,   K; 
SET  D  =  A  <£,2>  SCALAR,    R  =  A  <2,3>  ROW  VECTOR, 
C   =  A  <3,2>  COHJMNVECTOR,    M  =  A  <3,3>; 
C  =  C/D; 

M  =   (M  -   C  *  R); 
END  OGOS; 


K-l 
K      • 


K-l 


K 


A  <1,1> 


A  <1,2> 


A  <1,  3> 


A  <2,1> 


D 


R 


A  <3,1> 


M 
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By  programming  M=M-C*Rby  submatrices,  we  obtain  GOSM  (code  omitted 
here).   Submatrix  Gaussian  elimination  with  coefficient  storage  can  be 
written  as  [HOUS64,  MCKE69]  : 


for  K  =  1  to  N  do; 
A^   gos(akk,  m)j 

for  J  =  1  to  m  do; 

for  i  =  K*m+1  to  n  do; 

ai,  (K-l)*m+J  =  ai,  (K-l)*m+j/a(K-l)*m+J,  (K-l)*m+J; 
end; 
for  i  =  K*m+1  to  n  do; 

for  j  =  (K-l)*m+J+l  to  (K-l)*ntf-m; 

aij  =  aij  "  ai,(K-l)*m+J  *  a(K-l)*m+J,j\ 


end; 


end: 


for  i  =  (K-l)*m+J+l  to  K*m  do; 
for  j  =  K*m+1  to  n  do; 

aij  =  aij  "  ai,(K-l)*m+J  *  a(K-l)*m+J,j> 
end; 
end; 
end; 

for  I  =  K  +  1  TO  N  do; 
for  Jl  =  K  +  1  TO  N  do; 


A      =  A      -  A    *    A 
I,J1     I,J1     IK     K,J1 


end: 


end; 
end; 


Based  on  this  algorithm,  we  can  write  OGCSS  and  GOSS  (code 
omitted  here).  We  note  that  both,  GOSM  and  GOSS  can  be  modified  by  the 
loop  reversal  technique.  McKeller  et  al  [MCKE69]  present  a  submatrix 
Gaussian  elimination  algorithm  which,  in  our  terminology,  is  GOSSR.   In 
Figure  kml$,   we  have  plotted  s(  J)  vs.  J    for  GOS,  GOSR,  GOSM  and  GOSS. 
Once  again,  behavior  similar  to  Cholesky  factorization  is  observed. 


S(T) 
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n  =  24,   m  =  4,  N=6 


GOS 


< GOSR 


13        17       21        25       29       33    36 


Figure  U.15-      Localities   of  GOS,   GOSR,   GOSM  and  GOSS 
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l*.2.i|  Gram- Schmidt  Orthogonalization 

Given  a  generally  tall  (n  X  n„)  matrix  A  of  rank  n?  (n?  <  n  ), 
its  decomposition  into  A  =  B  X  R  is  known  as  the  Gram-Schmidt 
orthogonalization  method.   Here  B  is  an  (n,  X  n )  matrix  with 
orthonormalized  columns  and  R  is  a  nonsingular  upper  triangular  matrix 
of  order  ru  [SCHW73].   It  can  be  written  as: 

for  k  =  1  to  n  do; 

for  j  =  1  to  k  -  1  do; 

nl 
r..  =  Z  b.  .  *  a.,  j 

3         i=l  1J    lk 
for  i  =  1  to  n  do: 


b..  =  a.,  -p..  *  b.  .; 

lk    lk    jk    \y 


end; 
end; 


rn   =   V  Z  b . ,    , 
kk      .  ,  lk 
i=l 

for  i  =  1  to  n,  do; 

bik  =  bik/rkk' 
end; 
end; 

A  straightforward  implementation  of  this  algorithm  is  called 
ORTH  (omitted  here).  We  store  B  over  A,  thus  destroying  the  original 
elements  of  A.   A  reversal  technique  applied  to  ORTH  turns  it  into 
ORTHR.   To  identify  the  matrix  multiplications  within  ORTH,  we  write  it 
in  OL/2  and  call  it  OORTH. 

OORTH:  PROC(A,  R,  n  ,  n  ); 

LET  A  BE  A  n_  X  n^   MATRIX  AND  LET 

R  BE  AH  UPPER  TRIANGULAR  MATRIX  OF  ORDER  (n  ); 
FOR  K  =  1,  2,    ...,    n2;  PARTITION  A,  R  AFTER 

COLUMNS  K-l,  K; 
SET  Q  =  A  <1>;  AK  =  A  <2>,  RK  =  R  <1,2>,  D  =  R  <2,2>; 
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EK  =  Q*    *  AK; 
AK  -  AK  -  Q  *  RK; 
D  =   (AK,   AK);     AK  =  AK/D; 
END  OORTH; 

A 


K-l 


K 


Q           ■    AK 



If  we  code  the  statements  RK  =  Q'  *  AK  and  AK  =  AK  -  Q  *  RK  by 

submatrices,  we  obtain  ORTHM.   Submatrix  version  of  ORTH  is  [SCH073] 

n  n 

written  in  OL/2,  and  is  called  OORTHS.   Note  that,  N0  =  —  and  N-,  =  — . 
1  '      2    m      1   m 


OORTHS :  PROC, 
LET  A  BE  n  X  n2  MATRIX  AND  R  BE  AN 

UPPER  TRIANGUIAR  MATRIX  OF  ORDER  (n  ); 

FOR  K  -  1,  2,  ...,  N2;  PARTITION  A,  R  AFTER 

COLUMNS  (K-l)  *  m,  K  *  m; 
SET  Q  =  A  <2>,  D  =  R  <2,2>,  M  =  R  <£,3>,    G  =  A  <3>; 

CALL  OORTH  (Q,  D,  n  ,  m); 
M  =  Q'  *  G; 
G  =  G  -  Q,  *  M; 
END  OORTH; 
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(K-l)*m 


(K-l)*m 


This  can  be  easily  programmed  in  PL/ I  and  is  called  ORTHS.   In 
Figure  If.  16,  we  have  plotted  s(T)  vs.  T  for  ORTH,  ORTHR,  ORTHM  and 
ORTHS.   Once  again,  reversal  technique  can  be  applied  to  ORTHM  and 
ORTHS  but  the  improvement  thus  gained  is  not  significant. 


k» 3  Conclusion 

In  all  four  algorithms  considered  so  far,  we  have  seen  that  three 
methods  of  locality  improvement  work  very  well.  We  have  seen  that  a 
submatrix  version  of  an  algorithm  improves  the  locality  to  the  greatest 
extent.   The  submatrix  multiplication  technique  for  all  internal 
multiplications  is  the  second  best  technique  for  improvement.   Reversal 
technique  also  improves  locality  but  by  a  much  smaller  degree  than  the 
other  two  techniques.  We  also  notice  that,  locality  improving  methods 
cannot  reduce  page  faults  for  very  large  core  allotments.   The  reason  is 
that,  with  a  large  number  of  pages  allotted,  the  working  set  of  an 
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fix  s  24,     n2  =  16 
N!  *  6  ,     N2  =  4 


ORTHR 


ORTHM 


13        17       21        25       29       33        37  39 


Figure  1|.  16.      Localities   of  ORTH,    ORTHR,    ORTHM  and  ORTHS 
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ill-structured  (poor  locality)  program  can  also  be  kept  in  MM.  We  have 
observed  that  locality  improvement  methods  are  effective  when  c  <  ^(N). 
Where  (q  (N)  denotes  kN  for  some  constant  k.   Thus  the  asymptotic  value 
of  page  faults  is  unaffected  by  locality  improvements.   It  is  also  clear 
that  the  asymptotic  value  of  page  faults  cannot  be  changed  by  changing 
paging  algorithms  if  we  restrict  ourselves  to  demand  paging.   [There  is 
no  replacement  required  at  high  core  allotments.]   In  Chapter  5>  we  will 
see  that  the  asymptotic  value  of  the  number  of  page  faults  can  be 
reduced  by  devising  prepaging  algorithms. 

At  this  point,  we  are  in  a  position  to  compare  the  paging 
performance  of  the  three  decomposition  algorithms  used  in  solution  of 
linear  systems  [Cholesky  decomposition,  LU  decomposition  and  Gaussian 
elimination].   In  Figure  h.YJ,   we  have  plotted  u(c)  vs.  c  for  CD,  HJ 
and  COS  using  LRU  paging  algorithm.   The  superiority  of  Cholesky 
decomposition  is  very  clear.   But,  as  is  well  known,  it  can  only  be 
used  when  the  matrix  A  is  symmetric  and  positive  definite.   It  is  also 
known  that  this  algorithm  has  minimum  operation  count  and  has  superior 
stability  property  among  these  three  decomposition  algorithms  [ISAA66] . 
When  the  given  matrix  A  is  not  symmetric  or  is  not  positive  definite  and 
if  we  have  to  choose  between  HJ  decomposition  and  Gaussian  elimination, 
BJ  decomposition  is  preferable.  Also,  LU  decomposition  incurs  lesser 
number  of  operations  and  is  also  more  stable  [ISAA66],  from  Figure  ^-.17> 
it  is  seen  that  LU  decomposition  is  generally  superior  to  Gaussian 
elimination  in  number  of  page  faults  also. 
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L0G2(7T(c)) 


PAGING   ALG.   USED  -  LRU 
n  =  24 ,   m  =  4 


9    13    17   21   25   29   33   37 


Figure  4. 17.   Page  Faults  for  CD,  LU  and  GOS 
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k.k     A  Note  About  Measurement  of  Performance 

We  have  combined  LRU  stack  algorithm  of  Mattson  et  al  [MATT70], 
WS  statistics  gathering  algorithm  of  Gordon  [GORD73]  and  multivalued  MIN 
algorithm  of  Belady  [BELAY1*]  into  LRU_WS_MIN  algorithm.   This  algorithm 
obtains  the  LRU,  WS  and  MIN  statistics  of  a  given  reference  string  in 
one  pass  for  all  values  of  page  allotments  (up  to  the  size  of  the  address 
space)  and  for  all  values  of  the  window  width  7  (up  to  a  fixed  maximum). 
We  give  a  PL/ I  program  for  this  algorithm  in  Appendix  A. 

Once  a  simple  one  pass,  paging  algorithm  simulation  is  available, 
the  next  question  is  to  easily  obtain  the  reference  strings  for  a  given 
program.   Since  for  the  matrix  programs  that  we  have  considered,  each 
program  has  only  one  trace  and,  therefore,  there  is  only  one  reference 
string  associated  with  it.  Also  note  that,  we  are  only  interested  in 
data  references  and,  in  particular,  only  in  array  references.   Given  a 
matrix  program  written  in  PL/ I,  we  remove  all  declarations.  All  assignment 
statements  and  expressions  are  reduced  by  removing  all  scalar  references 
(and  retaining  only  array  references).   Each  array  reference  is  now 
replaced  by  a  call  to  the  appropriate  paging  algorithm  simulator 
(in  our  case  LRU_WS_MIN ) .   For  each  such  call  we  must  also  supply  the 
page  number  of  the  referenced  page.   This  is  determined  from  the  name 
and  index  of  the  array  referenced  and  the  organization  of  the  arrays  in 
VA  space.   For  example,  a  matrix  of  order  n  is  stored  from  page  one  to 
page  W   by  submatrices  and  the  submatrices  are  stored  in  a  column  major 
order,  then  a  reference  A(i,  j )  will  be  a  reference  to  the  page  number 


-i-|-l)  +  fi- 
rn1   '   '  in- 
clement A  (i,  j)  belongs,  is  (f-"|,  f^l ). 


x  =  (N*(r— 1-1)  +  \— ]  ),  since  the  index  of  the  submatrix,  to  which  the 


m"  '  m 
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5.   APPLICATION  OF  FREEING  AND  PREPAGING  TO  ARPAY  PROGRAMS 


5.0  Introduction 

The  objective  of  prepaging  is  to  reduce  or  avoid  page  faults.   In 
many  array  algorithms,  which  are  dominated  by  data  paging,  we  have  some 
inherent  structure  which  will  allow  a  fore -knowledge  of  paging  needs. 
This  can  be  utilized  in  several  ways :   First,  if  we  know  that  a  set 
of  data  pages  is  not  required  by  the  program  in  future  then  the  space 
occupied  by  these  pages  in  MM  can  be  freed  for  some  other  use.   Second, 
if  we  have  extra  space  in  MM,  then  we  may  be  able  to  prefetch  some  of 
the  required  pages.  We  have  noted  in  Chapter  h,    that  the  locality- 
improvement  methods  reduce  page  faults  only  for  small  values  of  the  page 
allotment  c.  We  will  see  that  the  methods  used  in  this  chapter  can 
help  reduce  page  faults  for  larger  values  of  c.  Thus  we  will  have  covered 
the  complete  spectrum  of  c-values.  We  will  show  that  an  order  of 
magnitude  reduction  in  the  asymptotic  value  of  page  faults  can  be  achieved 
by  our  methods. 

Once  again,  we  start  out  with  an  example,  namely,  Cholesky 
decomposition  (CD),  and  show  how  to  improve  its  performance.  We  will 
also  be  concerned  with  measuring  the  improvement  in  the  performance. 
Finally,  in  subsequent  sections,  we  will  apply  the  techniques  to  other 
matrix  algorithms.  -  In  this  chapter,  we  assume  that  the  programmer  is 
making  these  improvements  in  his  program.   In  Chapter  6,  we  will  discuss 
the  question  of  automating  these  techniques. 
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5.1  Cholesky  Decomposition 

It  is  easier  if  we  describe  the  algorithm  using  OL/2  and  an 
associated  diagram  to  identify  the  data-working- set.   Let  us  consider 
the  program  OCD  (refer  to  section  J+.l)  and  the  following  partitioned 
array : 


k-1 

k 


A  <1,1>        \ 

k-1 

>D 

R 

^k 

M 

C 

A  <3,3>  \ 

The  algorithm  OCD  shows  that,  for  a  particular  value  of  k,  the  subarrays 
R,  D,  M  and  C  are  referenced,  but  the  subarrays  A  <1,  1>  and  A  <3, 3>  are 
not.   Furthermore,  we  note  that  the  elements  of  A  <3,3>  were  not  used 
in  the  past,  but  the  present  elements  of  A  <3,3>  will  be  referenced  in 
the  future.   On  the  other  hand,  all  the  elements  of  A  <1, 1>  were  used 
in  the  past,  therefore,  they  are  likely  to  be  in  MM  and  the  algorithm 
shows  that  they  will  not  be  used  again.   Thus  A  <1, 1>  may  be  marked  as 
a  dead  subarray.   Since  the  mechanism  that  we  have,  only  allows  pages 
to  be  declared  dead,  it  is  necessary  to  modify  our  mechanism  for  dead 
subarrays.  Because  of  dynamic  partitioning,  the  extent  of  the  subarrays 
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vary  dynamically.  In  this  example,  A  <1, 1>  may  share  pages  with  the 
useful  subarrays  R,  D,  M  and  C.   The  set  of  shared  pages  is  a  null  set 
whenever  the  condition,  (k-l)  =  0  mod  m,  is  satisfied.   This  condition 
corresponds  to  the  alignment  of  a  page  boundary  with  the  bottom  boundary 
of  the  subarray  A  <1, 1>  (given  by  the  expression  k-l).  Under  such  a 
condition,  we  can  free  all  the  pages  of  A  <1, 1>.  We  assume. the  existence 
of  a  procedure  FREE(b)  which  frees  all  the  pages  of  the  subarrays,  which 
are  resident  in  MM.   Based  on  this  observation,  OCD  can  be  modified  to 
yield  OCDF  as  follows: 
OCDF:   PR0C(A,  n); 


IF  M0D(k,  m)  -  1  THEN  CALL  FREE(A  <1,1>); 
D  =  SQRT(D  -  (RT,  rt)); 
C  =  (C  -  M  X  R')/D; 
END  OCDF; 

A  corresponding  modification  is  made  in  the  Pi/ I  program  CD  (refer 

to  section  ^-.l)  to  obtain  CDF.  We  assume  the  existence  of  a  procedure 

FREE  (A,  I,  J),  which  declares  the  (I,  J)  submatrix  of  array  A  dead.  We 

note  that  this  freeing  operation  is  carried  out  each  time  k  -  1  =  0  mod  m. 

Therefore,  we  need  only  free  the  bottom-most  row  of  pages  of  A  <1, 1>.   In 

Figure  5.1,  we  give  the  PL/l  program  CDF. 

In  Figure  5.2,  we  have  plotted  it(c)  vs.  c  for  CD  using  LRU  and  MIN 

paging  algorithms  and  for  CDF  using  FREELRU  paging  algorithm. 

We  see  that  at  and  around  point  D,  FREELRU  behaves  almost  like 

MIN  (the  optimal  demand  paging  algorithm).   The  page  allotment  for 

point  D  corresponds  to  the  expression   max   ( |p(R,D,M, C) | )  where 

ke[l,n] 

P(A-.,  Ap,  ...)  denotes  the  pages  containing  the  subarrays  A,,  A~,  ...  and 
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CDF: 
DO  k   =  1  TO  n;  K1FM  -   (k-1  )/m; 
IF  MOD(k,   m)   =  1  THEN 
DO  J  =  1  TO  K1FM; 

CALL  FREE  (A,    K1PM,    J); 
END; 
S  -  0; 
DO  J  =  1  TO  k-1; 

S  =  S  +  A(k,  J)  *  A(k,  J); 
END; 

A(k,  k)  =  SQRT(A(k,  k)  -  S); 
DO  I  =  1  TO  n-k; 
S  =  0; 
DO  J  -  1  TO  k-1; 

S  =  S  +  A(k+I,  J)  *  A(k,  J); 
END; 

A(k+I,  k)  =  (A(k+I,  k)  -  S)/A(k,  k); 
END; 
END  CDF; 


Figure  5.1.   Cholesky  Decomposition  with  Freeing 
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LOG2(7T(c)) 


n  =  24,  m  =4,  N  =  6 


3    5    7    9    11    13   15    17   19   21 


Figure  5.2.   Page  Faults  Using  LRU,  FREELRU  and  MIN  Algorithms 
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p|  denotes  the  number  of  pages  in  the  set  P.   Clearly,  this  is  the  same 


as  the  maximum  number  of  active  pages  during  the  execution  of  the  program 
CD.    This  is  easily  found  to  be  equal  to   »  r  <         which  in  the  case 
N  =  6  reduces  to  12.  We  note  that  the  asymptotic  value  of  page  faults, 
i.e.,  n(o°),  is  not  changed  by  freeing.   Note  that  it(»)  =  total  number 
of  distinct  pages  referenced  =  size  of  the  address  space  (in  case  each 

page  is  referenced)  =  — ^ -  (=  21  in  the  present  case).  We  note  that  for 

very  low  values  of  c,  freeing  does  not  reduce  page  faults  because  dead 
pages  are  removed  anyway  by  the  replacement  action  of  LRU.   Only  after 
c  =  >6(N)  then  freeing  starts  to  help.   For  very  high  values  of  c 
(i.e.,  £)(Nj),  once  again,  freeing  does  not  help  since  the  whole  of  the 
VA  space  can  be  stored  in  MM  and  no  page  replacements  are  necessary. 
Because  of  simulation  time  limits  we  have  not  carried  out  our  experiments 
with  large  values  of  the  matrix  size.   It  is  hoped  that  with  large  value 
of  n  (the  matrix  size)  the  advantages  of  freeing  will  be  brought  out 
more  clearly. 

We  note  that  the  freeing  technique  can  be  applied  to  other 
variations  of  Cholesky  decomposition  (e.g.,  CDR,  CDM,  CDS).   Since  the 
nature  of  improvements  is  similar,  we  do  not  discuss  these  further. 

Next  we  proceed  to  investigate  what  can  be  prepaged  in  OCD. 
We  know  that  the  four  useful  subarrays  are:   R,  C,  M  and  D.   Since  the 
extent  of  these  subarrays  change  with  the  loop  control  variable  k,  we 
will  indicate  this  by  subscripting  these  subarrays.   Note  that, 
R(k+1)  c  M(k)  U  C(k)  and  M(k+l)  c  M(k)  U  C(k).   Therefore,  if  we  decide 
to  look  only  one  loop  execution  ahead,  we  need  to  prepage  the  subarrays 
C  and  D  only.  We  note  that,  P(c(k+l)  =  P(c(k))  if  k  f   0  mod  m.   Here 
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P(B)  denotes  the  pages  containing  the  sub array  B.   Similarly, 
P(D(k+l))  =  P(D(k))  if  k  ^  0  mod  m.   The  reason  for  this  can  be  seen 
from  the  following  diagram: 


1 


m 


J 


R(k) 
D(k) 

D(k+l) 


R(k+1) 


Page  boundaries 


C(k)     C(k+l) 


Therefore,  we  need  to  prepage  only  when  the  condition  k  =  0  mod  m  is 
satisfied.  Under  such  a  condition  we  prepage  C(k+l)  and  D(k+l). 

If  we  decide  to  prepage  the  newly  required  elements  of  C  and  D 
for  each  new  execution  of  the  loop  just  before  beginning  the  execution 
of  the  loop,  then  we  will  prepage  C(k)  and  D(k)  if  k  =  1  mod  m.  Since 
we  are  going  to  consider  only  demand  prepaging  algorithms,  it  does  not 
matter  when  we  prepage  a  certain  set  of  pages,  so  long  as  the  prepaging 
is  done  prior  to  a  reference  to  any  one  of  these  pages.  Assume  that  we 
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have  issued  a  prepage  statement  for  a  set  of  pages  P  at  time  t.  Assume 
further  that  at  t'  >  t,  r   =  i  6  P,  then  and  at  t'  we  have  a  page  fault. 
At  this  time  page  i  will  be  fetched  and  as  many  pages  of  P  will  be 
prefetched  as  will  be  allowed  by  the  number  of  empty  pages  in  MM.  With 
this  modification,  we  obtain  the  following  algorithm: 
OCDFP:   PR0C(A,  n,  m); 


IF  M0D(k,  m)  =  1  THEN 

DO;  CALL  FREE  (A  <1, 1>);  CALL  PREPAGE (C,  D); 

END; 
D  =  SQRT(D  -  (R1,  R'))j 
C  =  (C  -  M  X  R')/D; 
END  OCDFP; 

We  can  also  modify  CDF  to  obtain  CDFP  in  a  similar  way.   In 

Figure  5.3,  we  have  plotted  Jt(c)  vs.  c  curve  for  CD  using  LRU  paging 

algorithm,  for  CDFP  using  FREEDPRElfLRU  paging  algorithm  and  for  CD  using 

DPMIN  paging  algorithm.   We  see  that  prepaging  using  FREEDPREI^LRU  (in  fact, 

demand  prepaging)  reduces  the  page  faulting  considerably  for  large  values 

of  c.   In  particular,  the.  asymptotic  value  of  it  has  been  brought  down  from 

n(n+i) 

— *~ — -  to  N.   This  is  one  order  of  magnitude  reduction.   We  see  that  the 

asymptotic  value  of  jt  for  DPMIN  is  1.   We  note  here  that  such  a  prepaging 
technique  can  also  be  applied  to  other  variations  of  Cholesky  factorization 
(namely,  CDR,  CDM  and  CDS)  but  we  omit  this  here,  since  the  results  are 
very  similar  to  that  of  CD.   Since  with  prepaging,  the  number  of  page  pulls 
is  different  from  the  number  of  page  faults,  we  need  to  consider  these  two 
measures  separately.   In  Figure  5.  k,   we  have  plotted,  the  number  of  page 
pulls  vs.  c  for  CD  using  LRU,  MIN  and  DPMIN  paging  algorithms  and  for  CDFP 
using  FREEDPRE^LRU  paging  algorithm.   We  see  that,  the  number  page  pulls 
for  FREEDPRElf.LRU  is  generally  lower  than  the  page  pulls  for  LRU  and  only 
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n  =  24,  m  =  4,  N  =  6 


10-- 


1  3         5  7         9         11        13        15        17        19       21 


Figure   5.3.      Page  Faults  Using  LRU,   FREEDPTOLRU,   MIN  and 
DPMIN  Paging  Algorithms 
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A  LOG2(PAGE  PULLS) 
13- 


Figure  5-4.   Page  Pulls  Using  LRU,  FREED  PRE  IfLRU,  MIN 
and  DPMIN  Paging  Algorithms 
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rarely  does  it  go  above  and  then  only  by  a  very  small  amount.  We  see  that 
the  page  pulls  for  DPMTN  and  MIN  are  generally  equal  except  in  some  cases 
where  DPMLN  incurs  more  page  pulls  than  MIN.   Note  that  the  asymptotic 
value  of  page  pulls  is  the  same  for  all  cases  and  is  equal  to  the  size  of 
the  data  address  space  (=  N(N+l)/2). 

We  can  view  the  process  of  freeing  and  prepaging  done  in  CD  in 
a  slightly  different  way.   Let  us  denote  the  data-working  set  of  one  loop 
execution  of  CD  by  DWS(k),  where  k  is  the  loop  control  index.   In  words, 
DWS(k)  is  the  set  of  pages  referenced  in  the  k   execution  of  the  loop. 
DWS(k)  can  be  recursively  defined  as  follows: 

a)  DWS(l)  =  P(C(1))  U  P(D(1)) 

b)  If  k  =  1  mod  m  then 
DWS(k)  =  DWS(k-l)  U  P(D(k)) 

U  P(C(k))  -  P(A<l,l>(k)  -  A<l,l>(k-1)) 
=  DWS(k-l)  U  P(D(k))  U  P(C(k)) 

-  P(R(k-l))  -  P(D(k-l)) 
else 

DWS(k)  =  DWS(k-l); 
This  knowledge  of  DWS  allows  us  to  do  the  appropriate  freeing  and  prepaging. 

5.2  Other  Examples 
5.2.1  LU  Decomposition 

For  LU  decomposition  (refer  to  OLU  and  the  associated  diagram 
in  section  k.2),   we  have 

a)  DWS(l)  =  P(Y(1))  U  P(Z(1)); 
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b )  IF  k  =  1  mod  m  then 

DWS(k)  =  DWS(k-l)  U  P(Y(k))  U  P(Z(k)) 

-  P(A<l,l>(k-l)) 
=  DWS(k-l)  U  P(Y(k))  U  P(z(k) 

-  P(R(k-l))  -  P(X(k-l))  -  P(D(k-l)); 
else 
DWS(k)  =  DWS(k-l); 
Based  on  this  observation,  LU  and  OLU  can  be  modified  to  free 
A<1, 1>  and  prepage  Y  and  Z.  We  do  not  present  the  detailed  programs 
LUF,  LUFP,  OLUF  and  OLUFP.   In  Figure  5.5,  we  have  plotted  n(c)  vs.  c  for 
LU  using  LRU,  MIN  and  DPMIN  paging  algorithms,  for  HJF  using  FREEDPREULRU 
paging  algorithm.  We  see  that  freeing  and  prepaging  helps  only  for  larger 
values  of  c  and  that  the  asymptotic  value  of  -n(c)   has  been  brought  down 
from  N  to  N. 

5.2.2  Gaussian  Elimination 

For  Gaussian  elimination  (referring  to  OGOS  and  its  associated 
diagram  in  section  h.2.),   we  have, 

a)  DWS(l)  =  P(A);  i.e.,  the  whole  array  A. 

b )  If  k  =  1  mod  m  then 
DWS(k)  =  DWS(k-l) 

-   P(A<1,1>  U  A<1,2>  U  A<1,3>  U  A<2,1> 

U  A<3,1>). 
=  DWS(k-l)   -  P(c(k-1))   -  P(D(k-l))   -  P(R(k-l)); 

else 
DWS(k)   =  DWS(k-l); 
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Figure  5.  5.   Effect  of  Prepaging  on  LU  Decomposition 
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From  these  equations,  we  see  that  only  prepaging  that  can  be 
done  is  outside  (before)  the  loop.  And  since  whole  of  the  array  A  is 
to  be  prepaged  only  a  limited  number  of  pages  as  allowed  by  the  available 
space  will  really  be  prefetched.   Freeing  of  the  subarray  A<1, 1>,  A<1,2>, 
A<1, 3>,  A<2,3>  and  A<3>1>  can  be  done  within  the  loop  when  the  condition, 
k  =  1  mod  m,  is  met.  With  these  modifications  we  have  GOSF,  GOSFP, 
OGOSF,  and  OGOSFP  (we  omit  the  corresponding  programs).   In  Figure  5.6, 
we  have  plotted  jt(c)  vs.  c  for  GOS  using  IRQ,  MIN  and  DPMIN  paging 
algorithms,  for  GOSF  using  FREELRU  paging  algorithm  and  for  GOSFP  using 
FREEDPREULRU  paging  algorithm.   We  note  that  jt( FREELRU )  =  rt(LRU).   In 
other  words,  freeing  does  not  help  in  this  algorithm.   Reason  for  this 
is  that  dead  page  is  precisely  the  least  recently  used.  page.   Thus  the 
page  replaced  by  FREEIiRU  and  LRU  is  the  same  at  any  time.   The  same 
effect  is  not  necessarily  observed  with  other  paging  algorithms.   For 
example  if  we  were  using  FIFO  instead  of  LRU,  then  in  all  probability, 
Tt(FREEFIFO)  <  it  (FIFO)  for  GOS.  We  also  note  that,  prepaging  and  freeing 
does  help  and  reduces  the  asymptotic  value  of  it(c)  from  DJ  to  1. 

5.2.3  Gram-Schmidt  Orthogonalization 

For  Gram-Schmidt  orthogonalization  (referring  to  OORTH  and  its 
associated  diagram  in  section  h.2)}   we  have, 

a)  DWS(l)  =  P(D(1))  U  P(RK(1))  U  P(AK(l)); 

b )  If  k  =  1  mod  m  then 

DWS(k)  =  DWS(k-l)  U  P(AK(k))  U  P(RK(k)) 

U  P(D(k))  -  P(R<l,l>(k)) 
=  DWS(k-l)  U  P(AK(k))  U  P(RK(k)  U  D(k)) 
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Figure  5.6.   Effect  of  Prepaging  on  Gaussian  Elimination 
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-  P(RK(k-l)  U  D(k-l)); 
else 
DWS(k)  =  DWS(k-l); 
From  these  equations,  we  see  that,  when  the  condition  k  =  1  mod  m 
(within  the  loop  just  after  the  backward  branch  target),  we  can  free  the 
subarray  R<1, 1>  and  prepage  the  subarrays  AK,  RK  and  D.  With  appropriate 
modifications,  we  obtain  ORTHF,  ORTHFP,  OORTHF,  and  OORTHFP.   In 
Figure  5.7,  we  have  plotted  jt(c)  vs.  c  for  ORTH  using  LRU,  MM  and  DPMIN 
paging  algorithms,  for  ORTHF  using  FREELRU  paging  algorithm  and  for 
ORTHFP  using  FREEDPRE^LRU  paging  algorithm.  We  see,  once  again,  that 
freeing  and  prepaging  helps  for  larger  values  of  c  and  that  it  brings 
down  the  asymptotic  value  of  it(c)  from  (N  N  +  N (Np+l)/2)  to  N  . 

5.2. k     Matrix  Multiplication 

For  matrix  multiplication,  consider  the  following  program  OMM: 
OMM: 

LET  A,  B,  C  BE  MATRICES  OF  ORDER  (n); 

FOR  k  =  1,  2,  ...,  n-1;  PARTITION  A,  B  AFTER 

ROWS  k-1,  k; 

A<2>  =  B<£>  *  C; 
END  OMM; 


k-1 

k 


B<2> 


A 


B 
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Figure   5. !•      Effect  of  Prepaging  on  Orthonormalization 


118 


[Note  that  the  above  can  easily  be  programmed  in  OL/2  as  the  primitive 
A  =  B  *  C  but  we  have  written  OMM  for  obtaining  DWS.]  From  the  above, 
we  have, 

a)  DWS(l)  =  P(C)  U  P(B<2>(1))  U  P(A<2>(1)); 

b )  If  k  =  1  mod  m  then 

DWS(k)  =  DWS(k-l)  -  P(B<l>(k)) 

-  P(A<l>(k))  U  p(B<2>(k))  U  p(A<^>(k)) 
=  DWS(k-l)  -  P(B<2>(k-l))  -  P(A<2>(k-l)) 
P(B<2>(k))  U  p(A<2>(k)); 
else 
DWS(k)  =  DWS(k-l); 
Based  on  this,  we  must  prepage  the  whole  array  C  outside  (before) 
the  loop  body  and  inside  the  loop,  we  can  free  subarrays  B<1>  and  A<1> 
and  prepage  subarrays  B<2>  and  A<2>  when  the  loop  control  variable 
satisfies  the  condition  k  =  1  mod  m.  With  appropriate  modifications, 
we  obtain  MMF  and  MMFP.   In  Figure  5.8,  we  have  plotted  it(c)  vs.  c  for 
MM  using  LRU,  MEN  and  DPMIN  paging  algorithms,  for  MMF  using  FEEELRU 
paging  algorithm  and  for  MMFP  using  FKEEDFRElj.LRU  paging  algorithm. 
Once  again,  we  see  that  for  large  values  of  c,  freeing  and  prepaging  helps 
and  that  it  reduces  the  asymptotic  value  of  jt(c)  from  3N^  to  N. 

We  have  seen  that  our  methods  of  freeing  and  prepaging  improve 
performance  for  large  values  of  c  (>  £)(N))  and  that  the  methods  work 
for  many  different  matrix  algorithms.   The  method  owes  its  success  to 
our  ability  to  identify  the  data-working- sets  of  the  programs.   This 
identification  is  possible,  because  of  the  well  structured  data- referencing 
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Figure  5»8»   Effect  of  Prepaging  on  Matrix  Multiplication 
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of  matrix  algorithms.   In  Chapter  6,  we  will  investigate  whether  the 
identification  of  data-working- sets  can  be  done  by  an  optimizing  compiler, 


5.3  Combined  Effect  of  Locality  Improvement  and 
Working-Set  Identification  Methods 

We  have  noted  that  the  locality  improvement  methods  of  Chapter  k 
improve  paging  performance  for  low  values  of  c  (<6(N))  and  the  methods 
of  this  chapter  improve  paging  performance  for  higher  values  of  c  (>£(N)), 
In  this  section  we  will  combine  the  two  methods  and  measure  the  resultant 
improvements . 

The  version  of  Cholesky  factorization  without  any  improvements  is 
CD  and  the  version  with  all  improvements  is  CDSRFP.  We  give  a  PL/ I 
program  for  the  latter  in  Appendix  C.   In  Figure  5.  9,  we  have  plotted 
jt(c)  vs.  c  for  CD  using  LRU  paging  algorithm  and  for  CDSRFP  using 
FREEDFREULRU  paging  algorithm.   The  vast  improvement  in  performance 
needs  no  elaboration.  We  have  also  plotted  the  number of  page  pulls 
vs  c  for  CDSRFP. 

In  Figure  5.10,  5 .11,  5.12  and  5.13  we  draw  similar  curves  for 
LU  decomposition,  Gaussian  elimination,  Gram- Schmidt  orthogonal! zat ion 
and  matrix  multiplication  respectively.   Fully  improved  versions  of 
these  algorithms  are  also  presented  in  Appendix  C. 
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Figure  5.9.      Page  Faults   for  CD  and  CDSRFP 
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n  *  24,    m  «  4,    N  «  6 
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Figure  5.10.   Page  Faults  for  LU  and  LUSRFP 
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Figure  5.11.   Page  Faults  for  GOS  and  GOSSRFP 
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Figure  5.12.   Page  Faults  for  ORTH  and  ORTHSRFP 
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0    PAGE    FAULTS    FOR    MM    USING    LRU    PAGING    ALGORITHM 
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Figure   5.13.      Page  Faults  for  MM  and  MMSRFP 
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6.   AUTOMATION  OF  PERFORMANCE  IMPROVEMENT  TECHNIQUES 

6.0  Introduction 

We  have  shown,  in  Chapter  k,    how  to  improve  the  locality  of  matrix 
algorithms,  and  in  Chapter  5,  we  have  shown  how  to  include  freeing  and 
prepaging  techniques.  We  have  seen  that  these  techniques  improve  the 
paging  performance  of  matrix  algorithms.   However,  in  the  previous  chapters, 
we  have  implicitly  assumed  that  the  programmer  has  the  responsibility  to 
modify  his  program  to  obtain  the  improved  paging  performance.   In  general, 
they  are  not  capable  or  willing  to  make  these  changes;  neither  is  it 
desirable  to  detract  the  user  from  his  main  purpose — namely,  to  solve  his 
problem.   Therefore,  we  will  consider  how  one  is  to  automate  these 
techniques  by  incorporating  them  into  the  computer  system. 

The  methods  that  we  have  presented  clearly  require  an  advanced 
knowledge  of  the  program's  reference  pattern,  which  means  that  the  operating 
system  contains  insufficient  knowledge  of  the  reference  string.  An 
optimizing  compiler,  on  the  other  hand,  may  be  able  to  gather  enough 
information  about  the  reference  pattern  and  may  be  able  to  make  the 
necessary  changes  to  obtain  the  improved  paging  performance.   In  section 
6.1,  we  will  sketch  several  transformations  of  program  segments  which 
improve  the  locality  of  the  segment,  and  we  will  give  simple  conditions 
for  these  transformations  to  be  valid.   In  section  6.2,  we  will  present 
an  algorithm  which  will  abstract  information  from  an  0L/2  program  and. 
then,  later,  use  this  information  to  do  freeing  and  prepaging  for  the 
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program.   It  is  clear  that  a  similar  algorithm  can  be  given  for  an 
element  (Pl/^  like)  source  language,  but  the  amount  of  information  to 
be  processed  will  be  very  large  so  as  to  be  prohibitive  in  the  amount  of 
overhead  incurred. 

6.1  Transformations  Which  Improve  Locality 

6.1.1  Loop  Reversal 

First  we  start  out  with  an  element  language  like  PL/l.   Consider  a 
program  segment  P  as  follows : 
P:  DO  I  =  1  TO  n  ; 

« 

DO  J  -  1  TO  n ; 

si 

S2 

END; 

• 

END  P; 
Now  consider  a  transformation  T;  of  the  program  segment  P  into  the  program 
segment  P'. 

P' :      DO  I  =  1  TO  n   ; 

• 

IF  M0D(l,2)    =  1  THEN 
DO;    JLLsd;    JHL=*np;    JSTEP=1  ;  END;    ELSE  J 

DO;    JLL^;    JHL=1;    JSTEP=-1;  END; 

DO  J-JLL  TO  JHL  BY  JSTEP; 

si 

S„ 


END; 


END  P'; 
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We  will  say  that  the  transformation  T  is  valid  if  the  program 
segments  P  and  P'  are  equivalent  (in  the  input-output  sense)  [ALLE71] .   In 
case  T  is  valid,  we  will  say  that  the  inner  loop  of  P  is  reversible.  We 
want  to  obtain  (easily  testable)  sufficient  conditions  under  which  T  is 
valid. 

Let  us  denote  the  sequence  of  statements  within  the  inner  DO  loop 
(of  P)  by  F(l,j)  where  the  subscripts  indicate  the  value  of  the  two  loop 
control  variables  during  the  execution  of  the  statement  sequence  F. 
Clearly,  P  and  P'  are  equivalent,  provided  the  following  segments  X  and  X' 
are  equivalent. 

X:  F(I,1) 

X':  F(l,n2) 

F(l,l) 

A  sufficient  condition  for  X  and  X'  to  be  equivalent  is: 

F(l,  j)  commutes  with  F(l,k)  for 

Vie  [1,1^]  and  V  j,  k  e  [l,n2J  3  ^  k. 
Bernstein  [BEBN66]  has  derived  sufficient  conditions  for  cammutivity  of 
program  segments.   Note  that,  we  need  only  pa  ial  cammutivity  here, 
i.e.,  we  require  that,  F(l,  j)  F(l,k)  sequence  is  equivalent  to 
F(l,k)  F(l,  j)  for  V  k  >  3,   V  I.   We  will  repeat  his  conditions  here. 
Let  us  denote  the  set  of  addresses  fetched  and  not  stored  during  the 
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execution  of  F(l,  j)  by  W(l,  j),  the  set  of  addresses  stored  and  not  fetched 

by  X(l,  j),  the  set  of  addresses  fetched  first  and  stored  later  by  Y(l,  j) 

and  the  set  of  addresses  stored  first  and  fetched  later  by  Z(l, j).   Let 

us  denote  by  VL,,  X  ,  Y_,,  Z  ,  the  similar  set  of  addresses  for  the  rest  of 
K    K    K    K 

the  program  segment  P  excluding  the  inner  DO  loop.  Then,  sufficient 
conditions  for  P  and  P'  to  be  equivalent  are: 

1)  (W(l,k)  U  Y(l,k))  fl  (X(l,o)  U  Y(l,j)  U  Z(I,J))  =  fa 

2)  (¥(l,j)  u  T(I,J))  n  (X(l,k)  U  Y(l,k)  U  Z(l,k))  =  fa 

3)  (x(i,k)  u  y(i,k)  u  z(i,k))  n 

(X(l,j)  U  T(I,J)  u  z(i,j))  n  (wR  U  YR)  =  fa 

Note  that,  these  are  simple  conditions  which  are  easy  to  test. 

However,  these  conditions  exclude  the  following  case: 

P  :  DO  I  =  1  TO  n  ; 
S  =  0; 

DO  J  =  1  TO  n25 

S  =  S  +  A(I,J)  *  B(l,  J); 

END; 

END; 

This  case  represents  an  inner  product  being  calculated  within  the 
inner  loop  and  as  such  is  a  most  frequently  occurring  construct  in  matrix 
algorithms.   For  the  program  segment  P-,>   we  have, 

W(I,J)  =  {A(I,J),  B(I,J)}, 

X(I,J)  =  fa  Y(I,J)  =  {S}, 

Z(I,J)  =  fa 
Therefore,  Y(l,j)  fl  Y(l,k)  =  {S}  ^  0  which  implies  that  conditions  l)  and  2) 
are  violated.   But- it  is  clear  that  the  inner  loop  of  P  is  reversible. 
This  shows  that  the  conditions  we  have  are  sufficient  but  not  necessary. 


We  will  require  that  this  construct  be  specially  recognized  as  reversible. 
In  fact,  the  following  generalization  is  valid.   Suppose,  we  have  the 
following  program  segment: 
DO  I  =  1  TO  n-; 

DO  J  =  1  TO  n2; 

S  =  S  □  E(J); 
END; 

END; 
Assume  that  □  is  a  commutative  and  associative  operator  and  E(j)  is  some 
expression,  then  the  inner  J  loop  is  reversible. 

Therefore,  our  loop  reversal  technique  is  a  modified  technique 
of  Bernstein.   Furthermore,  it  is  generally  profitable  to  reverse  adjacent 
loops  in  the  opposite  directions.  We  apply  this  automated  procedure  to 
CD  and  see  whether  we  can  obtain  CDR  as  the  transformed  procedure  of 
section  k.l. 

The  two  inner  J  loops  are  immediately  recognized  as  reversible 
by  the  special  construct.   For  the  loop,  'DO  I  =  1  TO  n-k1,  we  have: 
W(K,I)  =  {A(K+I,1),  ...  A(K+I,  K-l), 
A(K,1)    ...  A(K,  k-l), 
A(K,K)}, 
X(K,I)  =  p, 
Y(K,I)  =  {A(K+I,K)}, 
Z(K,I)  =  {S}, 

=>W(K,I)  U  Y(K,I)  =  {A(K+I,1:K),  A(K,1:K)} 
X(K,J)  U  Y(K,J)  U  Z(K,J)  =  {A(K+J,K),  S) . 
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Clearly,  intersection  is  empty. 

WR  =  (A(K+l:n,  l:n-l)} 

\  =  fi,      \=   {A(K+1,  K+l),  ...,  A(n,n)} 

ZR={S}. 
The  triple  intersection  is:   SO  (W  U  Y  )  =  $.   Thus,  the  reversibility  of 
the  loop  is  established.   By  hand  simulation,  we  have  proved  that  our 
automated  loop  reversing  procedure  will  indeed  translate  CD  into  CDR.   In 
a  similar  fashion,  we  can  show  this  for  other  algorithms. 

We  now  turn  our  attention  to  programs  written  in  0L./2.   Quite 
clearly,  the  techniques  presented  for  an  element  source  language  are 
applicable.  We  have,  in  addition,  several  more  possibilities  of  reversal. 
Consider  a  vector  assignment  statement  within  a  loop  in  an  OL/2  program. 
The  vector  assignment  statement  will  be  compiled  into  a  DO  loop  by  the  OL/2 
compiler.   If  this  loop  is  detected  to  be  reversible,  then  the  compiler 
can  carry  out  the  loop  reversal.   The  conditions  for  reversibility  are 
simpler  in  this  case  and  are  as  follows  (assume  that  the  statement  is  a 
vector  expression  being  assigned  to  a  vector  C): 

1)  Either  the  right  side  of  the  assignment  statement  does 
not  contain  the  vector  C,  its  ancestor,  or  subarrays  in 
the  ACB  tree  [FHIL72]; 

2)  Or  the  assignment  statement  is  given  by  C  =  C  +  Exp 
where  Exp  is  a  vector  expression  satisfying  condition  1. 

If  these  conditions  are  satisfied  then  the  vector  assignment  is 
reversible.   Clearly,  this  procedure  will  translate  OCD  into  CDR. 
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6.1.2  Loop  Decomposition 

We  have  not  discussed  this  transformation  in  Chapter  h.      Loop 

decomposition  is  also  known  as  loop  unswitching  [ALLE71] .   Consider  the 

following  program  segment  P: 

P:  DO  I  -  1  TO  n; 

A(I)  =  A(I)  *  A(I); 
B(I)  =  B(I)  +  Is 

END; 
rest  of  the  program 


Now  consider  the  following  transformation  P'  of  P: 


P' :  DO  I  =  1  TO  ns 

A(I)  =  A(I)  *  A(I); 

END; 

DO  I  =  1  TO  n; 
B(I)  =  B(I)  +  I; 

END; 

• 

I  rest  of  the  program 


It  is  easily  seen  that  the  locality  of  P'  is  much  better  than  that 
of  P.   In  other  words,  whenever  a  single  loop  operates  on  two  or  more 
independent  data  streams  then  it  is  better  to  decouple  these  operations 
into  two  or  more  DO  loops.   Sufficient  conditions  for  the  transformation 
to  be  valid  can  be  stated  as  follows  [BERN66] : 

a)  (W1(I)  U  Y1(I))  fl  (X2(J)  U  Y2(J)  U  ^(j))  =  fi, 

b)  (w2(j)  u  y2(j))  n  (Xl(i)  u  y1(i)  u  Zl(i))  =  fi, 

c)  (x1(i)  u  Zl(i))  n  (x2(j)  u  z2(j))  n  (wR  u  yr)  =  fi, 

for  1  <  J  <  I  <  n. 
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6.1.3  Submat  rix-Mu.lt  ipli  cati  on-  Trans  format!  on 

In  Chapter  k,   many  matrix  algorithm  contain  matrix  multiplication. 
It  is,  therefore,  profitable  to  convert  these  internal  matrix  multiplications 
into  submatrix  multiplications. 

In  an  array  language  like  0L./2,  the  programmer  is  allowed  to  write 
a  matrix  multiplication  as  a  primitive  operation.   In  such  a  case,  the 
compiler  can  easily  compile  this  primitive  into  a  submatrix  multiplication. 
Thus  OCD  will  be  compiled  into  CDM.   If  we  also  apply  the  procedure  of 
section  6.1.1  then,  OCD  will  be  compiled  into  CDMR  with  a  corresponding 
high  degree  of  locality.   The  same  statement  holds  for  other  matrix 
algorithms. 

For  an  element  source  language,  it  is  more  difficult  since  the 
compiler  has  to  detect  a  sequence  of  statements  which  corresponds  to 
matrix  multiplication  or  its  variants.  A  number  of  papers  have  appeared 
on  the  topic  of  automatic  recognition  of  vector  and  matrix  operations  for 
the  purpose  of  parallelism  exploitation  [SCHN72],   The  same  techniques 
are  applicable  for  our  purposes. 

6.1.U  Conversion  From  a  Non-Submatrix  to  a  Submatrix  Algorithm 

In  general,  it  is  impossible  to  automate  a  conversion  from  a 
non- submatrix  algorithm  like  CD  to  a  submatrix  algorithm  like  CDS.   The 
transformation  procedure  would  be  required  not  only  to  understand  the 
logic  of  the  program,  but  also  to  have  knowledge  or  'intelligence'  about 
the  theory  of  matrix  algebra.  We  will  illustrate  the  difficulties 
involved  in  such  a  task  by  examples. 

First  we  establish  submatrix  analogs  (also  known  as  block  analogs)  of 
scalar  operations.   For  example,  the  submatrix  analogs  of  scalar  addition 
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and  scalar  multiplication  are  matrix  addition  and  matrix  multiplication 
respectively.   When  setting  up  the  submatrix  analog  of  division,  we  have 
a  problem.  Assume  that  in  the  non- submatrix  version  of  a  certain  matrix 
program,  the  expression,  E/A(k,k),  occurs.  A  possible  submatrix  analog 
of  this  expression  could  be  E  *  (AT  t)  ,  where  A    is  the  (I,  I) 

D       J-,  A-  J.,  J. 

submatrix  of  A,  I  =  fK/m],  and  E~  is  the  submatrix  analog  of  E.  Another 

possible  analog  of  the  expression  is  (A   )   *  E  .   Clearly,  these  two 

-L,  X        o 

expressions  produce  different  effects,  and  only  one  of  them  is  the  correct 
analog.   It  is  not  at  all  clear  from  the  original  program  which  of  the 
two  possible  analogs  is  correct.   This  particular  example  occurs  in  the 
transformation  of  CD  into  CDS  (refer  to  section  k.l). 

If  we  consider  Cholesky  decomposition,  then  the  submatrix  analog 
of  the  square  root  operation  is  the  Cholesky  decomposition  of  a  submatrix. 
There  is  no  way  to  automate  this  type  of  knowledge  at  present. 

Another  example  occurs  in  attempting  to  transform  ORTH  into  ORTHS 
(refer  to  section  ^-.2).   The  transformation  is  possible  because  of  the 
known  properties  of  orthonormal  vectors  [SCH073;  SCHW73] .  Again,  we  cannot 
incorporate  this  knowledge  in  a  compiler. 

In  case  of  eigenvalue  problem,  submatrix  algorithms  may  not  even 
exist.   The  basic  problem  is  the  noncommutative  property  of  matrix 
multipli cation  [ IDVA72 ] . 

6.2  Freeing  and  Prepaging 

In  this  section,  we  will  discuss  the  automation  of  freeing  and 
prepaging  for  matrix  algorithms,  written  in  OL/2  language.   It  should  be 
clear  that,  we  can  do  the  same  kind  of  optimization  for  any  other  source 


135 

language.  However,  for  an  element  source  language  the  overhead  incurred 
by  our  procedure  is  likely  to  be  intolerable. 

The  optimizer  will  be  a  module  of  the  OL/2  compiler.   It  will 
collect  some  information  about  the  program  during  the  syntactic  phase  and 
later  use  this  information  to  do  efficient  freeing  and  prepaging  for  the 
program.   Simply  stated,  the  job  of  the  optimizer  consists  of  two  parts: 
(a)  prepage  the  subarrays  and  the  arrays  needed  by  the  program  at  the 
proper  time  during  the  execution  of  the  program  and  (b)  free  memory  space 
occupied  by  the  dead  subarrays  and  dead  arrays  as  soon  as  possible.   We 
will  use  (sub)  array  to  denote  "subarray  or  array."  Ideally,  the  proper 
time  to  prepage  is  such  that  any  page  of  the  (sub)  array  should  be 
set  up  in  MM  before  the  first  reference  to  it,  and  'as  soon  as  possible' 
is  taken  to  mean  'just  after  the  last  use  of  the  corresponding  page'. 
Such  an  ideal  situation  will,  clearly,  be  equivalent  to  the  use  of  PWS 
paging  algorithm  and  result  in  zero  page  faults  as  well  as  very  economic 
use  of  memory.   Such  an  ideal  system,  however,  will  be  very  expensive  in 
terms  of  overhead.  We  consider  a  practical  solution. 

It  is  clear  that  some  sort  of  a  list  of  useful  (sub)  arrays  needs 
to  be  prepared  during  the  syntax  analysis  phase.   This  list  can  be  used 
in  two  possible  ways.   First,  it  may  be  used  to  insert  instructions  for 
freeing  and  prepaging  (sub)  arrays  during  the  coding  phase.   Second,  it 
may  be  used  dynamically,  to  prepage  (sub)  arrays  from  the  list  as  far  into 
future  as  demanded  by  the  desired  performance.   Clearly,  the  second 
approach  requires  variable  page  allotment  of  MM.   It  will  also  incur 
more  overhead  since  the  list  must  be  kept  in  MM  during  the  execution 
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phase,  but  it  is  likely  to  result  in  an  improved  paging  performance.   We 
will  discuss  the  implementation  of  the  first  approach,  since  it  is  more 
simple;  however,  the  technique  of  constructing  the  list  is  common  to  both 
approaches. 

Let  us  assume,  for  the  sake  of  argument;  that  the  program  can 
acquire  as  many  pages  as  necessary.   Furthermore,  assume  that  once  a 
(sub)  array  has  been  brought  into  MM  it  remains  there  until  after  the  last 
time  it  is  used.   Define  the  lifetime  of  a  (sub)  array  A  as  follows: 
L(A)  =  (t-,,  t  )  where  t  is  the  time  of  the  first  use  and  tp  the  time  of 
the  last  use  of  A.   Once  the  lifetime  of  each  (sub)  array  is  determined, 
then  our  problem  is  solved,  at  least  in  theory,  since  we  can  follow  the 
following  prepaging  and  freeing  strategy: 

Given  L(A)  =  (t1,  t£),  if  t  <  t±   -  T  then  PREPAGE(A);  and  if 
t  >  t  then  FREE (A).   The  variable  t  is  the  current  process  time  and 
T  is  the  average  page-fetch  time. 

We  construct,  therefore,  the  list  USE_LIST  whose  nodes  correspond 
to  the  (sub)  arrays  in  the  order  in  which  they  are  referenced  by  the  program. 
On  the  program  level  the  normal  sequential  ordering  is  used,  and  within  a 
statement,  right  end  order  of  the  expression  tree  is  followed  [JURI72], 
The  primary  link  in  the  USE_LIST  is  the  one  implied  by  the  above  ordering. 
We  will  refer  to  this  link  by  the  names,  'static  link'  and  the  NEXT  pointer. 
In  case  of  repeated  use  of  an  (sub)  array,  it  is  not  worthwhile  to  enter 
each  use  in  the  USE_LIST,  since  the  USE_LIST  will  become  very  long. 
Instead,  we  consider  only  the  first  and  the  last  use  and  the  interval 
between  them  has  been  defined  as  the  lifetime.   The  decision  made  here  is 
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quite  arbitrary.  We  could  have,  alternately,  defined  the  lifetime  L*  of 
( sub )  array  A  as : 

L*(A)  =  L(A)  0  SS(A) 

where  L(A)  is  the  lifetime  by  the  earlier  definition  and  SS(A)  is  the 
syntactic  scope  of  A  (as  in  a  block-structured  language  like  OL/2).  Note 
that,  SS(A),  in  general,  is  a  set  of  disjoint  intervals  and  since  L(A)  is 
a  single  interval,  L'  (A)  will  be  a  set  of  disjoint  intervals.  The  use  of  L' 
will  involve  more  overhead  but  will  result  in  obtaining  better  paging 
performance.  There  are  many  other  ways  of  defining  lifetime,  but  we  will 
use  the  first  definition,  namely,  L.   We  have  to  store  L(A)  for  each  A 
in  the  USE_LIST.  A  practical  way  to  do  this  is  to  have  a  node  for  A 
corresponding  to  the  first  applied  occurrence  of  A  and  store  a  lifetime 
pointer,  called  LP,  in  the  node,  pointing  to  the  last  applied  occurrence 
of  A. 

We  have  tacitly  assumed  that  the  sequence  of  uses  of  (sub)  arrays 
in  a  program  can  be  linearized.   There  are  several  language  constructs 
that  prohibit  this. 

The  GOTO  statement  introduces  nonlinearity  in  the  USE_LIST.   Since 
the  programming  language  that  we  are  using  (OL/2)  is  GOTO-less,  we  do  not 
have  this  problem. 

Another  source  of  nonlinearity  is  the  IF  statement.   Let  us  assume 
that  the  statement  I  =  (IF  E  THEN  S,  ELSE  Sp)  occurs  in  a  program. 
Ideally,  the  USE_LIST  for  this  statement  should  be: 

T 


/  U(S  )  \ 

u(D  =  u(E )  <r     i  y 
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where  U(X)  denotes  the  USE_LIST  for  the  program  segment  X.   Since  the 
branch  actually  taken  by  the  program  is  unknown  until  the  execution  time, 
we  have  an  obvious  problem.   The  IF  statement  makes  the  predictability  of 
the  future  reference  pattern  poor.   Therefore,  we  have  to  settle  for 
approximate  solutions.   There  are  various  solutions  to  this  problem. 

The  first  solution  is  to  linearize  U(l)  as  U(E  )  -  U(S  )  -  U(S  ). 
This  solution  implies  that,  independent  of  the  branch  taken  by  the  program, 
the  (sub)  arrays  in  both  S     and  Sp  will  be  considered  part  of  the 
data-working- set  and  will  be  prepaged.   This  has  two  implications:   first, 
potentially  useless  pages  are  brought  into  MM  incurring  extra  page  pulls 
and  wasting  MM.   Second,  if  U(S..  )  is  large  so  that  only  part  of  it  can  be 
prepaged  when  the  if-expression  E  is  evaluated  and  if  the  'false'  branch 
is  taken  then  a  good  part  of  U(Sp)  will  have  to  be  demand  paged. 

The  second  approach  assumes  that  U(l)  =  U(E  ).   In  other  words,  we 
assume  that  U(s)  =U(Sp)  =  null.   This  solution  implies  that  depending 
on  the  branch  taken  by  the  program,  U(S  )  or  U(Sp)  will  have  to  be  demand 
paged,  but  no  memory  space  is  wasted  nor  an  extra  channel  traffic  occurs. 

The  third  approach  allows  the  USE_LIST  to  be  nonlinear  and  assumes 
that 
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U(I)  =  U(El)  -  [STOP]   <^u(?{>   • 

F     e- 

This  solution  implies  that  all  prepaging  activity  be  stopped  after  the 
evaluation  of  the  if-expression. 

Which  of  these  solutions  is  the  most  useful  can  only  be  decided 
by  experience  with  a  working  system.  We  select  the  first  solution  for 
the  rest  of  the  discussion. 
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The  existence  of  loops  (e.g.,  K>R  loops,  WHILE  loops)  is  still 
another  source  of  nonlinearity.  One  possible  way  of  linearizing,  in  this 
case,  is  to  unfold  the  loops.   But  this  solution  is  unattractive  because: 
it  is  not  applicable  to  WHILE  loops  since  the  number  of  times  the  loop  is 
traversed  is  not  known  at  compile  time,  and  it  is  wasteful  of  memory  space 
since  a  lot  of  information  is  duplicated  in  the  USE_LIST.   The  alternative 
is  to  allow  loops  in  the  USE_LIST. 

The  fourth  source  of  nonlinearity  in  the  USE_LIST  is  the  call  to 
a  procedure.  We  assume  that  each  procedure  has  code  inserted  in  it  to 
do  its  own  freeing  and  prepaging.   Thus,  in  the  calling  procedure,  we  stop 
prepaging  at  and  beyond  the  point  of  call  and  resume  freeing  and  prepaging 
after  a  return  from  the  called  procedure.   In  our  treatment,  we  have 
ignored  call  statements  altogether  but  the  above  solution  may  easily  be 
incorporated. 

Assume  that  the  USE_LIST  has  been  constructed  for  a  given  program. 
Each  (sub)  array  in  the  USE_LIST  may  have  at  most  four  boundaries.   Now 
since  OL/2  allows  dynamic  partitioning,  these  boundaries  can  move  during 
the  execution  of  the  program.   Therefore,  each  boundary  has  associated  with 
it  a  boundary  expression  [PHIL72],   If  all  the  boundary  expressions  of  a 
(sub)  array  remain  constant  during  the  execution  of  a  loop,  then  such  a 
(sub)  array  will  be  called  static  (with  respect  to  the  loop).   Those 
subarrays  for  which  at  least  one  of  the  boundary  expressions  vary  within 
the  loop  are  called  dynamic  subarrays.   Clearly,  paging  techniques  for 
these  two  types  of  (sub)  arrays  should  be  different.   The  static  (sub)  arrays 
need  be  prepaged  only  once.  Whereas  the  extents  of  the  dynamic  subarrays 
change  during  the  execution  of  the  loop  and  hence  parts  of  these  subarrays 
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may  have  to  be  prepaged  during  the  loop  execution.  We  therefore,  decide 
to  originate  a  dynamic  link  at  the  beginning  of  a  FOR  loop  and  it  will 
thread  all  subarrays  within  the  scope  of  the  FDR  loop  which  are  dynamic 
with  respect  to  this  loop.   If  the  loop  control  variable  is  k,  then  the 
dynamic  link  will  also  be  called  k-chain.  We  define  a  predicate  P  such 
that  for  a  subarray  B  and  the  loop  control  variable  k,  P(B,k)  =  T  (true), 
iff  B  is  dynamic  with  respect  to  loop  k. 

Once  the  USE_LIST  has  been  constructed,  as  outlined  above, 
prepaging  instructions  can  be  inserted  during  the  coding  phase  as  follows . 

At  nest  level  I  =   0,  we  prepage  (sub)  arrays  in  the  USE_LIST, 
traversing  it  along  the  static  link.  At  nest  level  I   >  0,  prior  to 
generating  code  for  the  loop,  we  insert  instructions  for  prepaging 
(sub)  arrays,  traversing  the  USE_LIST  along  the  static  link;  and  within 
the  loop,  we  insert  instructions  for  prepaging  dynamic  subarrays 
traversing  the  USE_LIST  along  the  dynamic  link. 

We  will  now  discuss  how  to  insert  the  lifetime  pointers  (LP)  or 
equivalently,  how  to  determine  L(A)  for  each  A  in  the  USE_IIST.  We 
assume  the  existence  of  a  table  of  (sub)  array  names  (addressed  by  the 
(sub)  array  name)  and  containing  two  pointers  IP  and  CP  for  each  entry. 
IP,  the  initial  pointer,  points  to  the  first  occurrence  of  the  (sub)  array 
in  the  USE_LIST  and  CP,  the  current  pointer,  points  to  a  node  in  the 
USE_LIST,  where  currently  last  use  of  the  (sub)  array  would  have  been 
listed  if  we  were  to  list  each  use  of  the  (sub)  array.  At  the  end  of 
the  syntax  analysis  phase,  LP  =  CP  for  each  (sub)  array. 

How  and  when  to  free  (sub)  arrays  is  the  topic  to  be  discussed  next. 

A  simple  answer  to  this  question  is  to  free  a  (sub)  array  soon  after  its 
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end  of  lifetime.  First  of  all,  this  implies  the  existence  of  a  node, 
called  the  end-of- lifetime  node,  for  each  (sub)  array  in  the  USE_LIST. 
Whenever  such  a  node  is  scanned  during  the  traversal  of  USE_LIST  in  the 
coding  phase,  instructions  can  be  inserted  to  free  the  corresponding 
(sub)  array.   However,  there  are  several  problems  with  this  procedure. 
If  the  lifetime  L(A)  of  a  (sub)  array  A  is  completely  contained  within  a 
loop  f,  then  the  above  procedure  will  make  us  prepage  and  free  A  during 
every  traversal  of  the  loop.  To  alleviate  this  difficulty,  we  make  the 
following  change  in  our  procedure.   If  at  nest  level  £  =   0  then  we  free  A, 
on  scanning  the  end-of- lifetime  node  of  A;  and  if  at  nest  level  £  >  0 
then  we  free  A,  on  scanning  the  end-of- lifetime  node  of  A,  only  during 
the  last  traversal  of  the  loop.   Now  if  A  is  a  dynamic  subarray  then  this 
refined  procedure  also  breaks  down.   For  such  a  subarray,  the  contents  of  A 
during  the  last  iteration  of  a  loop  may  be  significantly  different  from 
the  contents  of  A  during  the  previous  iterations.   The  procedure  we  have 
outlined  thus  far,  may  leave  a  lot  of  dead  elements  in  MM.  A  simple 
scheme  immediately  comes  to  mind  to  solve  this  problem.  At  the  expiration 
of  the  lifetime  of  A  within  a  loop,  free  only  the  dead  elements  of  A, 
i.e.,  if  we  let  A(k)  to  be  the  elements  of  A  when  the  loop  control  variable 
takes  the  value  k  then  at  the  expiration  of  lifetime  of  A  during  k   loop 
traversal  free  (A(k+l)  -  A(k)).   The  danger  in  this  procedure  is  that 
these  freed  elements  may  be  the  useful  elements  of  an  adjacent  subarray 
during  the  next  loop  traversal.  We  solve  this  problem  in  the  following 
way.   Define  the  scope  of  an  array  A  by, 

S(A)  =  (B^  IP(B),  /^  LP(B)) 
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where  B  C  A  means  B  is  a  subarray  of  A.   Intuitively,  S(A)  is  an  interval 

specifying  the  first  use  and  the  last  use  of  A  or  any  of  its  subarrays. 

The  scope  S(A)  of  array  A  defines  a  natual  universe  over  which  we  can 

define  the  dead  subarrays  of  A.  Define  the  DEAD_LIST  as  follows: 

DEAB_LIST  (A,  S(A))  =  A  -  (A  fl  USE_LIST) 

-  {B|B  c  A,  B  is  static  with  respect  to  all 
loops  in  which  it  is  referenced} . 

The  expression  (A  PI  USE_LIST)  denotes  the  list  of  all  subarrays  of  A  that 

occur  in  the  USE_LIST.   Clearly,  B  e  DEAD_LIST  implies  that  B  is  dynamic 

with  respect  to  at  least  one  loop.   Now  it  is  easy  to  merge  the  DEAD_LIST 

with  the  USE_LIST.   For  each  array  A,  we  do  the  following:   for  each 

B  €  DEAD_LIST  (A,  S(A)),  and  for  each  loop  (with  control  variable  k),  in 

which  B  occurs,  if  P(B,  k)  =  T,  we  link  B  to  the  dynamic  chain  of  the 

loop.   It  is  clearly,  helpful  to  keep  all  dead  subarrays  at  the  head  of 

the  dynamic  chain  after  which  all  the  useful  dynamic  subarrays  follow. 

The  procedure  of  DEAD_LIST  construction  and  linking  to  USE_LIST  can  only 

be  done  in  a  post-syntax,  pre-coding  phase.   Now  the  code  for  freeing  and 

prepaging  dynamic  subarrays  is  easily  inserted  in  the  loop,  by  traversing 

the  dynamic  chain  of  the  loop. 

Thus  our  description  of  the  procedure  to  construct  the  USE_UST  is 
reasonably  complete.   However,  we  have  ignored  one  important  problem, 
namely,  for  B  c  A,  L(A)  fl  L(B)  ^  ft,    i.e.,  the  possibility  that  the 
lifetimes  of  a  subarray  B  and  (sub)  array  A  may  overlap.   We  solve  this 
problem  by  considering  several  cases  separately. 

If  L(B)  c  L(A)  then  we  delete  B  from  the  USE_EIST.  Since  we  have 
assumed  that  the  (sub)  array  A  is  resident  in  MM  throughout  its  lifetime, 
therefore,  its  subarrays  need  not  be  considered  separately. 
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If  L(B)  c  L(A)  but  IP(B)  <  IP(A)  and  LP(B)  <  LP(a)  then  we  extend 
the  lifetime  of  B  so  that  LP(B)  =  LP(A).   The  reason  for  this  is  that  when 
the  lifetime  of  B  has  ended,  but  the  lifetime  of  A  has.  not,  B  cannot  be 
freed  anyway. 

If  LP(B)  >  LP(A)  then  we  extend  the  lifetime  of  A  so  that 
LP(A)  =  LP(B).   If  this  simple  approach  is  not  used,  then  an  alternative 
approach  has  to  find  the  complement  of  B  with  respect  to  A  and  free  all  the 
the  subarrays  in  the  complement  of  B  with  respect  to  A  and  free  all  the 
subarrays  in  the  complement.   The  procedure  then  becomes  difficult  and 
time  consuming. 

The  next  question  to  be  discussed  is,  how  to  relate  (sub)  arrays 
to  pages.   The  freeing  and  prepaging  instructions  available  to  us  can  free 
and  prepage  only  pages,  but  the  information  in  USE_LIST  is  about  the 
(sub)  arrays.   Somehow,  we  have  to  bridge  the  gap.   The  problem  is 
compounded  by  the  fact  that  the  subarrays  can  change  their  sizes  during 
the  execution  of  the  program.   We  will  consider  the  cases  separately. 

If  an  array  occurs  in  the  USE_IIST  then  we  can  prepage  all  of  its 
pages  a  few  nodes  in  advance.   Similarly,  at  the  expiration  of  its 
lifetime,  we  can  free  all  of  its  pages.  We  have  assumed  that  the  array 
does  not  share  any  pages  with  other  useful  (sub)  arrays. 

If  we  consider  a  static  subarray  smaller  than  a  page,  then  while 
prepaging  it,  we  can  just  bring  in  the  page  in  which  the  subarray  is 
stored.  While  freeing,  however,  we  have  to  make  sure  that  this  page  does 
not  contain  any  useful  information.   For  the  sake  of  simplicity,  we  will 
not  free  such  a  subarray. 


If  we  consider  a  large  static  subarray  then  while  prepaging,  we 
fetch  all  the  pages  containing  the  elements  of  the  subarray.  While  freeing, 
we  can  free  only  those  pages  which  contain  just  the  elements  of  this 
subarray.   Other  pages,  which  are  shared  with  some  other  subarrays  cannot 
be  freed. 

We  now  consider  the  case  of  dynamic  subarrays.   Consider  a  loop  f 
with  the  control  variable  k  such  that  for  the  given  subarray  B,  P(B,k)  =  T 
(i.e.,  B  is  a  dynamic  subarray  with  respect  to  the  loop  f).   Let  us  denote 
the  values  that,  the  control  variable  k  takes  during  the  execution  of 

the  loop,  by  k  ,  k  ,  ...,  k.,  k   ,  ...  .   Let  us  denote  the  elements  of 

th 
subarray  B,  during  the  k   loop  traversal  by  B(k)  and  the  pages  containing 

B(k)  by  P(B(k)).   Then  P(B(kQ))  e  DWS(kQ))  and  A  P(B(ki+1))  =  P(B(ki+1))  - 

P(B(k.))  e  DWS(ki+1)  -  DWS(k±). 

This  means  that  the  elements  of  B(k  )  must  be  prepaged  before  the 
first  traversal  of  the  loop  and  that  during  each  subsequent  traversal 
A  P(B(k   ))  need  be  prepaged  (if  not  null).   The  first  requirement  is 
already  satisfied  since  before  the  first  traversal  of  the  loop,  we  prepage 
all  the  subarrays  along  the  static  chain  in  the  loop. 

For  satisfying  the  second  requirement,  we  must  determine  the 

conditions  under  which  A  P(B(k.-  n ))  will  be  non-null.   Before  that, 

l+l 

however,  we  will  determine  the  conditions  under  which  A  B(k.  , )  = 

l+l 

B(k.  , )  -  B(k. )  will  be  non-null.  We  will  assume  that  the  subarray  B 

has  four  boundary  expressions  associated  with  it.   Let  us  further  assume 

that,  B  =  A  <I,  J>,  i.e.,  it  is  <I,  J>  subarray  of  A.   Then  rT  .,  r  ,  C   , 

l-i   1   J— i 

Cj  are  the  boundary  expressions  associated  with  it  as  shown  in  Figure  6.1. 
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A<I.J> 

I-l 


ri 


Cj-i     Cj 


Figure  6. 1.   The  Partition  Lines 
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In  general,  these  four  expressions  are  functions  of  k.   Let  X  = 
{^(k),  CJ-;L(k)},  X2  =  {^(k),  Cj(k))  and  X  =  X1  U  X^   Define 


f  (x)  V  x  e  X  by: 


l+l 


1    if  x(k.  . )  >  x(k. ) 
l+l      i 


0    if  x(ki+1)  -  x(k±) 


-1    if  x(k.  . )  <  x(k. ). 
v  i+ly    v  i' 


Then  clearly,  the  condition  for  A  B(k.   )  to  be  non-null  is  given  by: 

(axeXBf         (x)  =  -1)  V  (ax  6  x2  B  f         (x)  =  1). 

i+1  i+1 

Intuitively,  the  condition  asserts  that  the  subarray  B  is  expanding  in  at 
least  one  of  the  four  directions.   Let  us  assume  that  for  a  subset 
X  of  X,  V  x  €  X  =>  B  is  expanding  along  the  boundary  expression  x.  We 
would  like  to  determine  the  conditions  under  which  A  P(B(k.  , ))  f  p. 
First  of  all,  it  is  clear  that, 


AB(k.  .)  =  p^/\  P(B(k.  _))  =  p. 


i+1 


i+1' 


Given  that  A  B(k   )  fi  ft,    it  may  be  that  A  P(B(k   ))  =  p.      If 
^  x  e  X  3  x(k. )  =  0  mod  m,  then  A  P(B(k.   ))  ^  p.      Intuitively,  this 
condition  asserts  that  one  of  the  expanding  boundaries  has  just  crossed 
a  page  boundary.   Thus  the  required  condition  for  prepaging  is: 


(A  B(ki+1)  ^  p)  /\  (a  x  e  XQ  ^  x(k±)  =   0  mod  m). 
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As  regards  freeing  dynamic  sub arrays,  during  the  k   loop 


traversal,  we  scan  the  dynamic  chain  and  try  to  free  all  the  dead  subarrays. 
If  the  subarray  shares  pages  with  other  subarrays  then  we  cannot  free 
these  pages.  Also  if  the  subarray  is  contracting  in  at  least  one  of  the 
directions  then  we  cannot  free  it,  since  the  present  elements  of  this 
dead  subarray  may  become  the  future  elements  of  some  useful  subarray.   The 
condition  under  which  a  dead  subarray  B  is  not  contracting  in  any  direction 
is  given  by: 

(Vx  e  Xn,  f    (x)  <0)a(Vxe  X  ,  f    (x)  >  0). 

The  condition  under  which,  the  subarray  B  does  not  share  pages  with  any 
other  subarray  is  given  by:  V  x  e  X,  x(k.  _ )  =   0  mod  m.  When  both  these 
conditions  are  satisfied,  we  can  free  all  the  pages  containing  the  dead 
subarray  B. 

It  may  happen  that  some  of  the  shared  pages  of  a  dead  subarray  belong 
to  another  dead  subarray.   This  sharing  will  stop  freeing  of  both  these 
subarrays  unnecessarily.   We  can  easily  incorporate  a  'combination  of  dead 
subarray'  function  in  our  procedure.   Assume  that  A  <1,1>,  A<1,2>,  A<1,3> 
occur  as  dead  subarrays  and  that  there  are  only  two  vertical  partition 
lines  of  A.   Then  the  combination  may  be  recognized  as  a  whole  (top)  row  of 
subarrays  and  may  be  denoted  by  A<1,*>.   With  this  kind  of  combination, 
we  may  be  able  to  free  more  frequently,  in  fact,  such  a  situation  does  occur 
in  Gaussian  elimination.   We  will  not,  however,  discuss  the  details. 

The  scheme  for  freeing  that  we  have  discussed  is  not  complete, 
because  there  may  exist  a  subarray  which  is  dead  but  which  always  shared 
pages  with  useful  subarrays,  and  therefore  is  never  freed.   To  alleviate 
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this  difficulty,  when  the  scope  S(A)  of  an  array  A  ends,  we  will  free  all 
the  subarrays  of  A  remaining  in  MM.   This  of  course,  implies  an  end-of-scope 
node  for  each  array  A. 

We  give  an  algorithm  (called  ALG_AP),  in  Appendix  D,  which  builds 
the  USE_I1ST,  builds  the  DEAD_I1ST  and  attaches  to  the  USE_LIST  and  finally 
inserts  instructions,  in  the  code,  for  freeing  and  prepaging. 
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7.   CONCLUSION  AND  FUTURE  RESEARCH 

7.0   Conclusion 

The  improvement  of  paging  performance  of  programs  running  under  a 
paged  virtual  memory  system  has  been  demonstrated  in  this  thesis.  As 
we  have  seen,  there  are  many  variables  which  affect  the  performance.   The 
paging  algorithm  used  by  the  system  and  the  locality  of  the  program  under 
consideration  are  two  of  the  more  important  variables  of  the  system.  We 
have  studied  the  effects  of  these  two  variables  on  the  performance  of 
programs  running  under  such  a  system. 

In  analyzing  the  performance,  it  was  necessary  to  choose  a  set 
of  performance  measures.   The  traditional  performance  measure  is  the 
number  of  page  pulls  incurred  by  the  program,  but  the  number  of  page 
faults  and  the  space-time  product  of  main  memory  are  also  important 
performance  measures.  Which  of  these  performance  measures  is  the  most 
important  depends  on  a  particular  system  and  the  mix  of  programs  running 
under  it. 

Demand  paging  algorithms  have  been  very  popular,  both  in  theory 
and  in  practice.  We  have  shown,  both  theoretically  and  practically,  that 
non-demand  algorithms  are  useful  in  improving  paging  performance. 

We  have  defined  a  fixed-memory  demand  prepaging  algorithm  DEMEN 
and  proved  that  it  is  an  optimal  demand  prepaging  algorithm  in  the  number 
of  page  faults.   This  algorithm,  however,  is  unrealizable  since  it  requires 
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the  advance  knowledge  of  the  future  reference  string.   It  serves  as  a 
benchmark  of  performance  for  realizable  demand  prepaging  algorithms. 

We  have  defined  several  realizable  prepaging  algorithms.  We  have 
studied,  for  each  of  these  algorithms,  whether  or  not  it  is  a  stack  algorithm. 
To  show  practical  usefulness  of  these  algorithms,  we  resort  to  a  specific 
application.  We  have  shown  that  these  prepaging  algorithms  improve  the 
paging  performance  of  common  matrix  algorithms  considerably.   The  reduction 
in  the  number  of  page  faults  is  shown  to  be  as  high  as  an  order  of 
magnitude.  We  have  measured  the  number  of  page  faults  and  the  number  of 
page  pulls  for  many  different  matrix  algorithms  using  many  different  paging 
algorithms. 

The  application  of  realizable  prepaging  algorithms  require  some 
knowledge  of  the  future  paging  needs  of  the  program.   If  the  programmer 
is  willing  to  do  this,  he  is  suitably  rewarded.  We  have  shown  that  a 
compiler  for  a  structured  array- language  like  OL/2  can  extract,  without 
too  much  overhead,  enough  information  from  a  program  to  improve  its  paging 
performance.   The  compiler  can  identify  portions  of  the  present  working 
set  which  cease  to  be  in  the  future  working  set.   The  compiler  can  also 
provide  reliable  advance  information  of  the  working  set.  As  a  result, 
moving  unwanted  information  out.  of  memory  and  bringing  in  useful 
information,  in  advance  of  its  use,  can  be  done  efficiently. 

We  have  defined  a  variable -memory  prepaging  algorithm  IWS,  based 
on  Denning' s  WS  algorithm.  We  have  shown  that  PWS  algorithm  incurs  zero 
page  faults,  has  the  same  number  of  page  pulls  as  WS,  and  it  is  more 
economical  in  ST  product  than  WS.   However,  IWS  is  an  unrealizable  algorithm 
since  it  requires  that  the  program's  reference  string  be  known  in  advance. 
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PWS  algorithm  can  serve  as  a  benchmark  of  performance  among  all  variable- 
memory  prepaging  algorithms. 

We  have  identified  three  methods  of  improving  the  locality  of 
matrix  algorithms.   These  are:   loop  reversal,  submatrix  multiplication 
and  the  construction  of  a  submatrix  algorithm.  We  have  shown  that  these 
methods  yield  a  vast  improvement  in  locality  and  that  the  most  powerful 
method  is  the  construction  of  a  submatrix  algorithm.   The  average  working 
set  size  is  known  to  be  a  good  measure  of  the  locality  of  a  program.  We 
have  shown  that  a  decrease  in  the  average  working  set  size,  generally, 
implies  a  reduction  in  the  number  of  page  faults. 

We  have  considered  the  problem  of  automating  these  methods  to 
improve  locality.   If  the  source  language  is  a  high-level  array- language, 
like  OL/2,  then  the  nature  of  the  primitives  of  the  language  provide  a 
much  easier  means  of  optimization. 

We  have  shown  that  the  locality- improvement  methods  are  effective 
for  lower  values  of  the  page  allotment  and  that  prepaging  methods  are 
effective  at  higher  values  of  the  page  allotment. 

7. 1  Future  Research 

As  we  have  noted,  very  little  is  known  about  prepaging  algorithms. 
Both,  theoretical  and  practical  investigations  are  needed  for  understanding 
the  behavior  of  prepaging  algorithms.   In  future,  memory  technology  will 
provide  very  large  memories  making  program  (instruction)  paging  less 
important.   Parallel  processing  will  provide  the  means  for  solving  problems 
with  very  large  data  bases,  thereby  making  data  paging  more  important. 
As  a  result  we  expect  the  emphasis  in  paging  to  shift  toward  specialized 
systems  where  more  information  can  be  used  to  anticipate  paging  demands. 
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There  are  many  matrix  algorithms,  for  which  submatrix  algorithms 
do  not  yet  exist.   This  is  a  useful  area  of  investigation. 

Actual  implementation,  either  in  software  or  hardware,  of  the 
paging  optimizer  proposed  in  thesis  needs  to  be  done.  The  use  of  the 
paging  optimizer  in  conjunction  with  variable -memory  prepaging  algorithms 
can  also  be  investigated. 

A  study  of  models  of  program  behavior  in  conjunction  with  prepaging 
algorithms  is  another  fruitful  area  of  investigation.   In  particular,  we 
refer  to  conjecture  3.1  regarding  the  optimality  of  the  paging  algorithm 
DPOBLLRU  for  the  sequential-random  model  of  program  behavior  (refer  to 
section  3.2.2.1). 
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APPENDIX  A 


PL/I  PROGPAMS  FOR  THE  SIMULATION  OP 
SOME  PAGING  ALGORITHMS 


A.l     LRU  WS  MIN 


We  will  present  LRU_WS_MIN  in  this  section.   There  are  three 
modules:   the  first  module  declares  all  the  global  variables  used  by 
LRU_WS_MIN,  the  second  module  is  the  procedure  LRU_WS_MEN  and  the  third 
module  computes  the  results.   Note  that  LRU_WS_MIN  Is  a  stack  algorithm 
and  therefore,  in  one  pass  of  the  reference  string,  obtains  the  page  fault 
behavior  for  all  possible  values  of  the  page  allotment  or  the  window  size 
whichever  is  pertinent  to  the  algorithm. 
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INIT((NA+1)(0))V 

INITUNA  +  1 )(UJJ, 
INITUNA+1)  (0)), 
INIT((NA+l)(0))f 
INIKOI  , 

INITIO)  » 

» 
t 
t 
f 


VARIABLES  ♦  / 

,  /♦ADDRESS  SPACE  SIZE   ♦  / 

/♦LRU  DI ST. COUNTER     */ 

/♦REF. INT. FREQ. COUNTER*/ 


/♦MMC  FREQ. COUNTER  ♦  / 

/♦LRUSTACKPAGEXINOEX  */ 

/♦LRUSTACK  POINTER  */ 

/♦REF.STR.  LENGTH  */ 

/♦MEMORY  SIZE  */ 

/♦LRU  SUCC. FUNCTION  */ 

/♦WS   SUCC. FUNCTION  ♦/ 

/♦MIN  SUCC. FUNCTION  ♦/ 

/♦THEIR  PAGE  FAULTS  ♦/ 
/♦REF. INTERVAL  COUNTER*/ 

/♦P  -STACK  ♦/ 

/♦P_STACK  POINTER  ♦/ 

/♦PAGE    FAULT    FREQENCIE*/ 
/♦AVE.WS    SIZE  ♦/ 

*/ 


MODULEi:/*DECLARE   ALL    GLOBAL 
DC  UNA 

LRUCNT(0:NAJ 
WSCNT    (0:NA) 
MINCNT(0:NA) 
XCNA+1) 
CURSTACK 
LNGTH 
C 

NCLRU 
NTWS 
NCMIN 

PCLRUtPTWStPCMINt 
REFCNT(NA+1)   IN  IT ( ( NA«-1 ) ( 0) ) , 
P(NA+l)         INITUNA*1)<0)  J, 
P_POINT        INIT(Oi  , 

«)  FIXED  BINOlfO); 
DCL(MCLRU,MTWS»MCMIN, 
STAU  ♦ 

##)  FLOAT  BIN; 
M0DULE2:/*  PROCEDURE  LRU_WS_MIN 
LRU_WS_MIN:PROC(XX> ; 

/♦THIS  PROC  SIMULATES  MATTSON-LRU-STACK-ALG. TOGETHER*/ 
/♦WITH  GORDON-MODIFICATION  TO  GET  WS-STATISTICS  AND  ♦/ 
/♦ONE  PASS  MIN  CF.BELADY  £  PALERMO  ♦/ 

DCLCXX  ,  /♦CURRENTLY  REFERENCED  PAGE  ♦/ 

DIST  INIT(O),  /♦STACK  DISTANCE*/ 
INT   INIT<0)t  /♦REF.  INTERVAL  ♦/ 
I,J,JPf  MINP.MINPP)  FIXED  BINOLtO); 
LRU.WS:  /*SEE  R.L.GORDON  ♦/ 
LNGTH=LNGTH+1; CURST ACK=CURSTACK*l; 
/♦PUT  THIS  PAGE  ON  TOP  OF  STACK   */ 
X( CURSTACK J=XX;REFCNT( CURSTACK )*l; 
/♦SEARCH  FOR  THIS  PAGE  IN  THE  STACK    ♦/ 
DO  I-CURSTACK-1  TO  1  BY  -1; 
IF  X(  U  =  XX   THEN 
DO;  /♦  FOUND  ♦/ 
INT*REFCNT( I ) ;DI ST=CURSTACK-I ; 
/♦MOVE  PART  OF  STACK  ONE  BELOW  */ 
DO  J=I  TO  CURSTACK-l  BY  i; 

X( J)=X(J*1);REFCNT(J)=REFCNT< J*l); 
END; 
END;ELSE  REFCNT(IJ=REFCNT(I)-i-i; 
END; 

/♦  STEP  LRU  L    WS  COUNTERS    ♦/ 

IF  INT>NA  THEN  INT*NA;/*DONT  TRUST  WSCNT(NA)+/ 
LRUCNT(DISTJ=LRUCNT{DIST)+l;WSCNT(INT)=WSCNT(INT)+l; 
IF  DIST=0  THEN  DIST=CURSTACK-1 ; ELSE 
DO;CURSTACK=CURSTACK-l;DIST=OIST-l;  END; 
/♦NEXT  IS  THE  CODE  FOR  BELADYS  M-OPERATOR  ♦/ 
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MIN: 

IF  01ST*0  THEN  RETURN; 

IF  DIST>P_POINT  THEN 

00;  /*  A  FRESH  PAGE-REF.  */ 

P_P0INT*P_P01NT*1;P(P_P0INT>-DIST*1;RETURN; 
END; 

/♦FIND  SMALLEST  ELE  BELOW  DIST  IN  P-STACK  */ 
MINPsP(l) ;J=l;LSMALL*P_POINT-DIST*i; 
DO  1*2  TO  LSMALL  BY  1; 

IF  PU)  <  MINP  THEN 

do;minp=p<i);j»i;  end; 
end; 

DO    WHILEfl'BJ; 

if  j*lsmall  then 
do; 

DO  I  =  J  TO  P_P0INT-1  BY  1; 

P(I)=P(I+1); 

END; 

P(P_POI NT )=MINP;MINCNT< MINP )=MINCNT( MINP 1*1; 

RETURN; 
END; 

minpp=pu«-1J;  JP*J*l; 
DO  I  =  J*2  TO  LSMALL  BY  1; 

IF  P(l)  <  MINPP  THEN 

D0;MINPP*P(1 );JP=I;END; 
END; 

P(JP)=MINP;P( J)=MINPP;J-JP; 
END; 
END  LRU_WS_MIN; 


162 


A.  2  DEMIN 


In  this  section,  we  will  present  DEMIN.   There  are  three  modules: 
the  first  module  declares  all  the  global  variables  used  by  DEMIN,  the  second 
module  is  the  procedure  DEMIN  and  the  third  module  computes  and  prints  the 
required  statistics.   Note  that,  DEMIN  is  not  a  stack  algorithm  and  therefore, 
requires  one  pass  of  the  reference  string  for  each  value  of  the  page 
allotment  c. 
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MODUI_El:/*DECLARE 
DCKNA 

DM_STK(2,NA) 

PREV 

CURR 

CURR.CNT 

FULL 

EMPTY 

LNGTH 

C 

FAULTS 

PULLS 


ALL  GLOBAL  VARIABLES  */ 

./♦SIZE  OF  THE  ADDRESS  SPACE  ♦  / 

INIT( (2*NA)(Ollf/*TWO  COL.  OF  NUM. MATRIX   */ 


INIT(l) 
INIT(Z) 

INIT(J) 
INIT(l) 
INITO) 
INIT(O) 


./♦INDICATES 
./♦INDICATES 

,/*#PAGES  IN 


LAST  MfcMSTATE  ♦  / 
CURR  MEMSTATE  ♦  / 
MEMSTATE       ♦  / 


INIT(O) 
INIT(O) 
)  FIXED 
MODULE2:/*PROCEDURE  DPMIN 
DPMIN:PROC(XJ; 
/♦THIS  PROC. OBTAINS  STAT. 
DCL(X    ,        /♦PAGE 


./♦#DF 

./♦#OF 
./♦#0F 
BIN(31,0)  ; 


MEM. PAGES  ALLOCATED  ♦/ 
PAGE  FAULTS  ♦/ 
PAGE  PULLS   ♦  / 

♦  / 


FOR  DPMIN 
INDEX 


♦/ 
♦  / 


I,ITEMP)  FIXED  8IM(31fO); 
LNGTH  =  LNSTm-l; 
IF  CURR_CNT  =  C  THEN 
DO;  /♦PRESENT  MEMSTATE  FULL  ♦/ 

/♦IF  THIS  PAGE  IN  MEM. THEN  QUIT  ♦/ 

IF  DM_STK(CURR,X)=FULL  THEN  RETURN; 

ITEMP=PREV;PREV=CURR;CURR=ITEMP; 

DO  I«l  TO  NA; 
DM_STK ( CURR  , I )  =  EMPTY ; 

END; 

DM_STK(CURRtXJ=FULL;;uRR_CNT  =  l;FAULTS  =  FAULTS«-l; 

return; 
end; 

IF  DM_STMCURR,X)=FULL  THEN  RETURN; 
IF  DM_STK(PREV,X)=FULL  THEN 

DO ;DM_STK( CURR, X)=FULL;CURR_CNT=CURR_CNT*l; RETURN; END; 
DM_STK(CURRfX)=FULL;CUR*„CNT»CURR_CNT*l; 
IF  CURR_CNT=l  THEN  FAULTS*FAULTS*1 ; 
ELSE  PULLS  =PULLS  *1; 

END  DPMIN; 
MOOULE3:/^COMPUTATION  OF  RESULTS 

PULLS  =  FAULTS«-PULLS; 

PUT  SKIP  DATA(C»FAULTS, PULLS, LNGTH) ; 


♦  / 
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A. 3   FREELRU 

In  this  section,  we  will  present  FREELRU.   This  is  a  stack  algorithm 
and  therefore,  it  obtains  the  page  fault  statistics,  for  all  possible  values 
of  the  page  allotment  c,  in  one  pass  of  the  reference  string.  There  are 
three  modules:  the  first  module  declares  all  the  global  variables  used  by 
FREELRU,  the  second  module  is  the  procedure  FREELRU  and  the  third  module 
computes  the  results. 
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MODULEl:/*DECLARE  ALL  GLOBAL  VARIABLES     */ 

DCHNA  ,/*SIZE  OF  AODRESS  SPACE  */ 

DISTCNT(G:NA)  IN IT( ( NA+1 )( 0) )» /*LRU  DIST. COUNTER  */ 
X(NA*1)  IMT(  <NA*1H0)),/*STACK  PAGE  INDEX  */ 
PSTATECNA+1)  INITUNA*1M0)),/*EFPTY-FULL  INDICATOR  */ 
EMPTY  INIT(0)t 

FULL  INIT(l), 

NORMAL  IMT(O),  /"^INDICATES  A  PAGE  REF.  */ 

DFREE  INIT(l),         VINDICATES  PAGE  IS  DEAD*/ 

CURSTACK       INIT(OI,         /^POINTER  TO  STACK      */ 
LNGTH  INIT(O),         /*REF.  STRONG  LENGTH    */ 

C  t         /*MEM.PAGE  ALLOTMENT    */ 

NC  t         /*SUCCESS  FUNCTION      */ 

FAULTS         INITIO),         /*#GF  PAGEFAULTS       */ 
#)  FIXED  BIN(31f0); 
DCHFC  ,/*SUCCESSFREQENCY*/ 
MC  ,/*PAGEFAULT  FREQ.*/ 
##)  FLOAT  BIN; 
MCDULE2:/*PkCC  FREELRU( STACK  ALGORITHM)  */ 
FREELRU:PROC(ACTICN,XX) ; 
DCHACTICN, VINDICATES  WHETHER  A  PAGE  REF.  OR  A  FREE  CCMM*/ 
XX     ,/*CURRETLY  REFERENCED  PAGE  */ 
I ,J,K,PTEMP,XTEMP, TEMP, FLAG)  FIXED  BIN(3l,0); 
IF  ACTICN=NCRMAL  THEN 
DO;/*A  REFERENCE  TO  PAGE  XX*/ 
LNGTH=LNGTH  +  1; 

/♦SEARCH  FOR  THIS  PAGE  IN  STACK   */ 
FLAG^O; 

DO  1  =  1  TC  CURSTACK  WHILEt FLAG=0 )  ♦ 
IF  PSTATEt I)=FULL  L    X(I)=XX  THEN 
DO; 
FLAG=i;  /*  TO  TERMINATE  SEARCH  */ 
PSTATEU  )=EMPTY; 
DISTCNT( I)=U1STCNT(I )+l; 
END; 
END; 

IF  FLAG=0  THEN 

/*  PAGE  NOT  FOUND  IN  CORE ,D ISTCNT < 0  )  IS  ABS.  */ 
/♦PAGE  FAULT  CCUNTER  */ 

DISTCNT(0)=UISTCNT(0)+1; 
/*  PLT  CURR  PAGE  ON  STACK  */ 
PTEMP  =  PSTATE(l);xTEMP  =  Xa)  ;PSTATE(  1)=FULL; 

X(i)=xx;K=i; 

DC  J=2  TO  CURSTACK  +  1  WHILE ( PTEMP=FULL  )  ; 
TEMP=PSTATE(J) ;PSTATE( J )=PTEMP ;PTEMP=T EMP ; 
TEMP  =  X(  J);X( J)=XTEMP;XTEMP=TEMP;K=J; 
END; 

IF  K>CURST-ACK  THEN  CURSTACK  =  K; 
END;/*CCDE  FOR  ACTICN=NORMAL  ENDS  ♦/ 
IF  ACTICN=DFREE  THEN 


/♦DECLARE  PAGE  XX  DEAD  IF  IN  MEM,  »/ 

DO; 

FLAG'O; 

DO  I»l  TO  CURSTACK  WHI LE ( FLAG=0 ) ; 
IF  PSTATE1  U=FULL  L    X(I)*XX  THEN 
DO;FLAG=l;PSTATE( I)=EMPTY;K=I ;END; 
END; 

if  flag=1  then  if  k=curstack  then 

curst ack=curs tack- 1; 
eno  freelru; 

mgdule3:/*result  computation  */ 
nc=o; 

DC  C=l  TO  NA; 

NC  =  NC«-DISTCNT(C)  ; FC =Nu/FLGAT < LNGTH J  ; 

MC  =  LO-FC ;FAULTS=LNGTH-NC ; 

PUT    SKIP    DATA(C,NC, FAULTS, FC,MC); 
END; 
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k.k     FREEDPREULRU 

In  this  section,  we  will  present  FREEDERE^LRU.   This  is  not  a  stack 
algorithm  and  therefore,  requires  one  pass  of  the  reference  string  for  each 
value  of  the  page  allotment  c.   There  are  three  modules:  the  first  module 
declares  all  the  global  variables,  the  second  module  is  the  procedure 
FREEDPRE^LRU  and  the  third  module  computes  and  prints  out  the  required 
statistics. 
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MODULEl: /'DECLARE  ALL  GLOBAL  VARIABLES 
DCL(NA  ,/*SIifc  OF  ADDRESS  SPACE 


INITUNAUO)  ), 
INITI  (NA)O)), 
INITO)       , 


/♦PAGfcTABLE 
/♦LRU  STACK 
/♦PAGE  REF. 


♦  / 
♦/ 

♦  / 

♦  / 

♦  / 


DP4_PT(NA) 

DP4.STACMNA) 

NORMAL 

PREPA6E        INIT(I) 

OFREE  INITO) 

0P4  POINT      INITO) 

FAULTS         INITIO) 

PULLS  INITO) 

PRECMT         INITIO) 

LNGTH  INITO) 

Ct  /♦MEM. SUE  */ 

«)  FIXED  BINOlfO); 
M0OULE2:/*PROC  FREEDPRE4LRU 
FREEDPRE4LRU:PR0C(ACTI0N,X); 
/♦  THIS  PROC. OBTAINS  #PAGE  FAULTS  £ 
/♦FOR  FREEDPRE4LRU  PAGING  ALGORITHM 
DCHACTON,  /'INDICATES  WHETHER  TO  FRE  Et  PREEPAGE 
/*A  REFERENCE  TO  PAGE  X 
X,       /♦CURRENTLV  REFERENCED  PAGE 


/♦LRUSTACK  POINTER  ♦  / 

/♦*OF  PEGE  FAULTS   ♦  / 

/*#GF  PAGE  PJLLS    */ 

/♦#OF  PREPAGED  PAGES  IN  MEM  ♦/ 

/♦REF. STRING  LNGTH*/ 


#PAGE  PULLS 


*/ 

♦  / 

♦  / 

OR*/ 
*/ 
*/ 


DCLCNOTPRESENT 

NOTSETUP 

NOTUSED 

USED 

I, J)  FIXED 
DCL(LACTION(0:2) 

LPAGE   (0:3) 


INITO), 
INIT(I), 
INITO), 
INITO), 
BINOlfO); 

INIT(LNORMAL,LPREPAGE,LFREE), 
INITI SQT_PRESENT,NOT_SETUP, 
M3T.USED,LUSE0)) LABEL; 
GOTO  LACTION(ACTION); 
/*A  REFERENCE  TO  PAGE  X  */ 
LNORMAL: LNGTH =LNGTH+i; 
GOTO  LPAGE(DP4_PT(X)); 

NOT_SETUP:/*REQUIRED  PAGE  NOT  YET  SETUP  IN  MEM 
FAULTS*FAULTS*i;/*  INC.  PAGEFAULT  COUNT  »/ 
/♦PUT  THIS  PAGE  ON  STACK  */ 
DP4_P0INT»DP4_P0INT*i ; 
DP4_STACK(0P4_PCINT)=X; 
DP4.PT (X)»USED;PREC^T*PRECNT-l; 
/♦DECLARE  ALL  NOTSETUP  PAGES  SETUP  ♦/ 
DO  1=1  TO  NA; 
IF  DP4_PT(I)«N0TSETUP  THEN 
D0;DP%_PT<I)»NOTUSEO;PULLS*PULLS*i;END; 
END; 
RETURN; 

not_used:/^req.page  setup  */ 

dp4_pt(x)«used;precnt=precnt-1; 

dp*_pqint«dp*_p0int«-1;dp4_stacmdp4_p0int)  =  x; 
return; 
lused:/*page  x  has  been  used  before  ♦/ 

do  i«dp4_p0int  to  1  by  -1 ; 

IF  DP4_STACMI)«X  THEN 
DO; 
DO  J«I  TO  DP4.P0INT-1; 

DP4_STACMJ)»DP4_STACKO-H); 
END; 

DP4.ST ACK I DP4.P0I NT  )«X ; RETURN ; 
END; 
END; 
/♦PAGE  X  IS  NOT  IN  MEM     */ 
NOT.PRESENT: FAULTS*FAULTS*1 ; 


♦  / 
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IF    DP4.P0INT+PRECNT    >»C    THEN 
00; 
DP4_PT<DP4_STACKUJ1»N0TPRESENT    ;/*REPLACED  PAGE    */ 
DO    J*l    TO   DP4_P0INT-l; 

DP4_STACM J)*DP4_SrACM J*ll  ; 
END; 
END; 

ELSE    DP4_P0INT«DP4_P0INT*l; 
DP4_STACMDP4_P0INTI«X; 
DP4_PT(X)*USED; 

/♦DECLARE  ALL  NOTSETJP  PAGES  SETUP  */ 
DO  1*1  TO  NA; 
IF  DP4.PTI 1)«N0TSETUP  THEN 
DO;DP*_PT(I)«NOTUSED;PULLS«PULLS«-i;END; 
END; 
RETURN; 
LPREPAGE:/*PREPAGE  PAGE  X    */ 
IF  DP4_PT(X)*N0TPRES£NT  £  DP4_PDINT*PRECNT<C  £  PRECNT  <  C-l  THEN 
DO;  DP4.PT  <X)«NOTSETUP;PRECNT*PRECNT«-l;  end; 
RETURN; 

LFREE:/*PAGE   X  IS  KN3WN  TO  BE  DEAD  NOW  */ 
IF  DP*_PT(X)  *USED  THEN 
DO  I*DP4_P0INT  TO  1  BY  -1; 
IF  DP4_STACMI)«X  THEN 

do; 

DO  J*  I  TO  DP4_P0INT-1; 

DP4_STACK( J)*DP4_Sr4CK(  J*l)  ; 
END; 

DP4_P0INT=DP4_P0INT-l; 
END; 
END; 

DP4_PT(X)*N0TPRESENT; 
RETURN; 
END  FREEDPRE4LRU; 

M0DULE3:/*C0MPUTATION  OF  RESULT        */ 
PULLS  =  FAULTS«-PJLLS; 
PUT  SKIP  DATA(CtFAULTS,PULLStLNGTH); 
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APPENDIX  B 


PROOF  OF  THEOREM  3.4 


Theorem  3.^:   PRE3A  is  a  stack  algorithm  for  {w|oo  e  N  ,  P(w)}« 
Proof:   It  is  sufficient  to  prove  that 

st(c)  C  S  (c+1)      Vc>l,  V  t  >  0. 

The  proof  is  by  induction  on  t.  Assume  inductively  that, 

(1)  U  (c+1)  =  Ut(c)  +  y  where  y  =  fi   or  y±   e  N. 

(2)  N.(c+1)  =  Nt(c)  +  y2  where  yg  =  0  or  y2  e  N. 

(3)  St(c+1)  =  St(c)  +  y3  where  y^  =  ft   or  y^  =  y1  or  y^  =  y2« 
(^)  C0UNT(y,  t,  c)  =  C0UNT(y,  t,  c+l),  Vy  e  N.(c). 

Clearly,  all  these  assumptions  are  trivially  true  for  t  =  0.   Now  depending 

on  r,  ,  there  are  two  cases  to  consider: 

t+1 

case  (l)  r,  -.  =  x  e  N.   Now  depending  on  x,  there  are  four  case  to  consider: 
l(a):  x  e  N,  (c+l)  which  implies  x  e  N,  (c)  orx/  N, (c). 
Assume  first  that  x  e  N  (c). 
From  step  1(a)  of  PRE3A, 

Vy  €  Nt(c),  C0UNT(y,  t+1,  c)  =  C0UNT(y,  t,  c)  +  1 
=  COUNT (y,  t,  c+1)  +  1 
-  COUNT (y,  t+1,  c+1). 
Let  Nt(c+1)  =  Nt(c)  +  y2. 
Then- 
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{y|y  e  Nt(c+1),  COUNT (y,  t+1,  c+l)  >  T} 
=  {y|y  €  Nt(°)^  COUNT(y,  t+1,  c+l)  >  T)  + 

if  COUNT (y  ,  t+1,  c+l)  >  T  then  y2 
=  {y|y  €  Nt(c),  COUNT(y,  t+1,  c)  >  T}  + 
if  COUNT (y  ,  t+1,  c+l)  >  T  then  y„. 
Also  let  U, (c+l)  =  U, (c)  +  y  where  y  and  y  cannot  both  be  nonzero. 
Therefore,  we  have,  from  step  1(a)  of  PRE3A, 

Ut+l(C+1)  =  Ut+l^C')  +  ^yl  °r  y2  °r  null') 
Nt+1(c+l)  =  Nt+1(c)  +  (y2  or  null) 

This  proves  all  the  inductive  steps. 

Next  assume  x  e  N  (c).   Now  if  N, (c+l)  =  N  (c)  +  yp  then  x  =  yp. 
Also  y  =  0,  and  S  (c+l)  =  S,(c)  +  x.   Therefore,  for  page  allotment  c, 
we  are  in  step  l(d)  of  PRE3A  and  for  page  allotment  c+l,  we  are  in  step  1(a) 
of  PRE3A. 


Ut+1(c+l)  =  Ut(c+1)  +  x  U{y|y  e  Nt(c+l),  C0UNT(y,  t+1,  c+l)  >  T) 
=  Ut(c)  +  x  U{y|y  e  Nt(c),  C0UNT(y,  t+1,  c+l)  >  T  } 
Also  since 

Vy  e  N .(c),  C0UNT(y,  t+1,  c)  =  C0UNT(y,  t,  c)  +  1 

"G 

and  C0UNT(y,  t+1,  c+l)  =  C0UNT(y,  t,  c+l)  +  1, 
we  have, 

Ut+1(c+l)  =  Ut(c)  +  x  U{y|y  e  Nt(c),  C0UNT(y,  t+1,  c)  >  T}. 
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But  from  step  l(d)  we  have  this 

=  Ut+l(c)  +  VUt(c)'  q>  x)* 
Also, 

St+1(c+l)  =  St(c+1)  =  St(c)  +  x 

=  st+l(c)  +  Wc)'  q>    X)* 
From  these,  the  required  results  follow. 
case  (lb):  x  e  U  (c+l)  =»  x  e   U  (c)  or  x  ^  U  (c).  Assume  first  that 

"C  Xj  "C 

x  6  U  (c)  and  assume  that  U  (c+l)  =  U  (c)  +  y,  .   From  step  1(b)  of  PRE3A, 

~G  "C  X  J- 

Ut+1(c+l)  =  Ut(c+1)  U  {y|y  e  Nt(c+1),  COUNT(y,  t+1,  c+l)  >  T) 
=  Ut(c)  +  y±   U  {y|y  €  Nt(c),  COUNT(y,  t+1,  c)  >  T) 
+  if  C0UNT(y2,  t+1,  c+l)  >  T  then  y2 

=  ut+1(c)  +  (y±  or  y2   or  nul1) 

Also, 

St+1(c+l)  =  St(c+1)  =  St(c)  +  (yx  or  y2) 

and  St+1(c)  =  St(c) 

.'.  .  St+1(c+l)  =  St+1(c)  +  y1  or  y2. 

This  proves  all  the  required  things. 

Next  assume  x  ^  U  (c)  =»  U  (c+l)  =  U  (c)  +  x. 

X>  Xj  ~G 

Nt(c+1)  =  Nt(c),  St(c+1)  =  St(c)  +  x. 

Therefore,  for  page  allotment  c+l,  we  are  in  step  1(b)  and  for  page 
allotment  c,  we  are  in  step  1(d)  of  PRE3A. 
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U   (c+l)  =  U  (c+l)  U  {y|y  €  N.(c+1),  COUNT(y,  t+1,  c+l)  >  T) 


t+1 


=  Ut(c)  +  x  U  {y|y  6  \(c),  COUNT(y,  t+1,  c)  >  t) 

=  ut+i(c)  +  Wc)'  q'  x)# 


Also, 


St+1(c+l)  =  St(c+1)  =  St(c)  +  x, 

St+1(c)  =  St(c)  +  x  -  RA(Ut(c),  q,  x). 

.*.   St+1(c+l)  -  St+1(c)  +  RA(Ut(c),  q,  x). 
From  these  the  required  results  follow. 
case  (lc):  x  j.   r,  then  we  are  in  step  l(c)  of  PRE3A.   We  have 

St+1(c+l)  =  Ut+1(c+l)  =  St(c+1)  +  x  -  RA(Ut(c+l),  q,  x) 

St+1(c)  =  Ut(c)  =  St(c)  +  x  -  RA(Ut(c),  q,  x). 

Let  Ut(c+1)  =  U  (c)  +  y1. 
Now  since  A  is  a  stack  algorithm, 

RA(Ut(c+l),  q,  x)  =  RA(Ut(c),  q,  x)  or  y±. 


Therefore, 


St+1(c+l)  =  5t(c+l)  +  x  -  (RA(Ut(c),  q,  x)  or  y^ 


=   St(c)  +  y1  +  x  -  (RA(Ut(c),  q,  x)  or  3^), 

Therefore, 

St+1(c+l)  =  St+1(c)  +  (RA(Ut(c),  q,  x)  or  y±) 

Also,  since 

Nt+1(e+D  -Ht+1(c)=A 

the  required  results  follow. 
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case  (id):  x  ^  S, (c+l)  =»  x  ^  S,  (c).   Therefore,  we  are  in  step  1(d)  of 
PRE3A. 

Ut+1(c+l)  =  Ut(c+1)  +  x  -  RA(Ut(c+l),  q,  x) 

U  {y|y  6  N  (c+l),  COUNT (y,  t+1,  c+l)  >  T} 

=  Ut(c)  +  y1  +  x  -  (RA(Ut(c),  q,  x)  or  y±) 

U  {y|y  e  Nt(c),  COUNT(y,  t+1,  c)  >  T} 

+  if  COUNT (y2,  t+1,  c+l)  >  T  then  y  . 
Also, 

Ut+1(c)  =  Ut(c)  +  x  -  RA(Ut(c),  q,  x) 

U  {y|y  e  Nt(c),  COUNT(y,  t+1,  c)  >  T) . 
Since  y  and  y  cannot  both  be  nonzero,  we  have, 

Ut+1(c+l)  =  Ut+1(c)  +  (y1  or  y2  or  RA(Ut(c),  q,  x)) 

Also, 

St+1(c+l)  =  St(c+1)  +  x  -  RA(Ut(c+l),  q,  x) 

=  St(c)  +  y3  +  x  -  (RA(Ut(c),  q,  x)  or  y±) 

=   st+1(c)  +  (y±   or  y2  or  RA(Ut(c),  q,  x)). 

From  these,  the  required  result  follows. 

case  (2);    r^+i  -  pre(x) 

Since  OJ   satisfies  the  property  P,  we  have,  x  /   |"?  (or 

x  /  S,  (c)  for  any  c.  ).   Assume  S, (c+l)  =  S,  (c)  +  y,  and  y_  is  not  null, 
w  t         t       j  5 

(i)  If  |st(c  +  1)|   <  c+l,  then  |s  (c)|  <  c. 

Then,   N.'n  (c+l)  =  N.  (c+l)  +  x  =  N.  (c)  +  y„  +  x 
"C+-L  X>  "t        2 

and    N.^,  (c)   =  N.  (c)  +  x 

"C+J.  "t 
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Also  since  y   =  y  , 

Nt+l(ctl)  =  Nt+l(c)  +  y5 
St+1(c+l)   =  St(c+1)  +  x 

=  St(c)  +  y^  +  x 
and  St+1(c)    =  St(c)  +  x 

Therefore,  St+1(c+l)  =  St+1(c)  +  y 

This  proves  the  required  result. 

(ii)   If   Is  (c+l)|    =  c+l,   then  |s.(c)|    =  c. 

Then,   st+1(c)    =  sf(c);    and  st+1(c+1)   =  st  (c+1) >    etc« 
Therefore,  all  the  required  results  follow. 

Now  assume  that  y  is  null,  i.e.   S, (c+l)   =  S, (c). 
j5  t  "t 

(i)   S .(c)   <   c  which  implies   S  (c+l)   <  c+l,  then  the 
required  results  are  trivial  to  prove. 

(ii)   S  (c)   =   c  which  implies  S  (c+l)  =  c  <  c+l. 
Therefore, 

S    (c+l)   =  S  (c+l)  +  x  =  S  (c)  +  x 
and    S   (c)    =  S  (c)  which  implies  S  , (c+l)  =  S    (c)  +  x. 
Also    N   (c+l)   =  N  (c+l)  +  x  =  Nt(c)  +  x 
and    N    (c)   =  N  (c)  which  implies  N   (c+l)   =  N  (c+l)  +  x. 
This  proves  the  required  result. 


Since  we  have  completed  all  possible  cases  of  r,  .. ,  the  proof  is 


complete. 
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APPENDIX  C 


PL/ I  PROGRAMS  FOR  SEVERAL  MATRIX  ALGORITHMS 


0. 1  Cholesky  Decomposition 

CDSRFP:PROC( A, NSMALL, MSMALL); 

/♦THIS  PROC  FACTORS  THE  SYM.POS .DEF. MATRIX*/ 
/♦A  (OF  ORDER  NSMALL)  INTO  THE  PROD,  OF  A  */ 
/♦LOWER  TRIANG.fc  AN  UPP.TRI.MAT.  */ 

/*it  is  a  submatrix  alg.with  loop  reversal*/ 
/*&  freeing  i   prepaging  */ 

n=nsmall/msmall; 

00  1=1  TO  N; 
/*  FREE  A<1,1>   */ 
00  J  =  l  TC  I-l; 
CALL  FREE(A,I-1,J); 

END; 

/*  PREPAGE  SUBARRAYS  C  &  0  DIFFERENTIALLY    */ 
DO  K  =  I  TO  N; 

CALL  PREPAGE(A,K,I ) ; 
END; 

/*  CARRY  OUT  D=D-R*R'     */ 
DC  J  =  l  TC  I-l; 
DO  I  1=1  TO  msmall; 
DO  Jl  =  i  TO  Hi 
S=0; 

DO  Kl=l  TO  MSMALL; 
S=S+A( (I-l)*MSMALL+Ilf ( J-l ) *MSMALL* J  1 ) 
*A( (I-l)*MSMALL+Ilt(J-ii*MSMALL*Jl); 
END; 

A( (  I-l  >*MSMALL+I1,(I-1)*MSMALL+I1)= 
A( (I-l)*MSMALL*Il,(l-l)*MSMALL+Ili-S; 
END; 
END; 
END; 

/*D0  CHOLDEC  OF  THE  DI AG. SUBMATRIX  U  */ 
CALL  CO(A(<  I-l)*MSMALL*l:l*MSMALL, ( I-l )*MSMALL+1 : I*MSMALL) 

fMSHALLi; 
/♦INVERT  SLT  PART  OF  I)  t  STORE  INTO  SUT  PART  OF  D*/ 
DO  K=2  TC  MSMALL; 
Sl=A((I-l)*MSMALL*Kf ( 1-1 )*MS MALL *K ) ; 
DO  I  1  =  1  TO  K-l; 
S=A(( I-1**MSMALL+K,J I-l )*MSMALL+I 1 ) / 

A( (I-l)*MSMALL+Il,(I-l)*MSMALL+Ill; 
DO  J=Il*l  TO  K-l; 
S=S*A((I-1)*MSMALL*-K,<  I- 1 ) *MSMALL* J )* 
A( (I-1)*MSMALL+I1, <I-l)*MSMALL+J); 

end; 

A({  I-1)*MSMALL*U,(I-H*MSMALL+K)  =  S/S1; 
END; 
END; 

/♦CCMPUTE  C=C-M*R»  ♦/ 

/♦REVERSE  FOLLOWING  LOOP  WITH  RE. TO  I  */ 
IF  M0D(I,2J=1  THEN 
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do;jll*i*i;jhl«n;jstep*i;end;else 
D0;jll*N;jhl*I«-1;JSTEP*-1;END; 

DO  J*JLL  TO  JHL  BY  JSTEP; 
/♦REVERSE  FOLLOW. LOOP  WITH  RES. TO  J  */ 
IF  MOD<Jt2)*l  THEN 

do;kll*i;khl*i-i;kstep*i;end;else 
DO;kll*I-1;khl*1;kstep*-1;END; 
00  K*KLL  TO  KHL  BY  KSTEP; 
DO  Jl*l  TO  msmall; 
DO  1 1* I  TO  msmall; 
S*g; 

DO  Kl*l  TO  MSMALL*. 
S=S*A((J-l)*MSMALL+Jl,(K-i)*MSMALL*KD* 
A(( I-1)*MSMALL*I1,(K-1)*MSMALL*K1); 
END; 

A((J-1)*MSMALL*J1,(I-1)*MSMALL*IIJ* 
A((J-1)*MSMALL*J1,( I-1)*MSMALL*11)-S; 
END; 
END; 
END; 

end; 

/*  MULT  G<JtI>  BY  G<I,I>  INVERSE  */ 
DO  4*1*1  TO  N; 
DCL  S(MSMALL)  FLOAT;  /*TEMP.  VECTOR  */ 
DO  Jl*l  TO  msmall; 
DO  11*1  TO  MSMALL; 

S< II )*A<  < J-l )*MSMALL+J1, ( 1-1 )*MSMALL+I 1) ; 
END; 

DO  11*1  TO  MSMALL; 
S*S(  1 1)/A(  (I-1)*MSMALL«-Ilt  ( I- 1  )*MSMALL*I  1)  ; 
DO  Kl=l  TO  11-15 

S*S*S(K1)*A(( I-1)*MSMALL*K1,(I-1)*MSMALL*I1); 
END; 

A(  (  J-1)*MSMALL*J1,U-1)*MSMALL  +  I1J*S; 
END; 
END; 
END; 
END  CDSRFP; 
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C.2   LU  Decomposition 

lusrfp:proc( a, nsmall, msmall  )  ; 

/♦computes  lu  decomposition  of  matrix  a  of  order*/ 
/*nsmall  &  stores  l  t  u  ln  ajemploys  a  submatrix*/ 
/♦algorithm  with  loop  rev . t freeinggprepaging    */ 
n=nsmall/msmall; 
DO  K=l  TO  N; 
/*  FREE  SUBAKRAY  A<1,1>  */ 
DO  1=1  TO  K-18 

CALL  FREE(A,I,K-1)  ;CALL  FR  EE  (  A  ,  K- 1 ,  I )  ; 
END; 

/*  PREPAGE  SUBARRAYS  Y  E,  I    */ 
DO  I=K  TO  N; 

CALL  PREPAGEU,  I  tK)  ; 
END; 

DC    J=K+1  TC  n; 
CALL  PREPAGE(A,K, J); 

END; 

/*  Fuf^M  D  =  D-R  *  X       */ 
DO  KK  =  1  TC  K-l; 
DO  11=1  TC  MSMALL; 
DC  Jl=l  TC  MbMALL; 
S  =  0; 

DO  KK1=1  TC  MSMALL; 
S=S+A(  (K-1)*MSMALL+I1  t(KK-l)*MSMALL+KKl) 
*A( ( KK-1)* MSMALL +KK1 , ( K~ 1 ) *MS MALL* Jl ) ; 
END; 

A( (K~l)*Mi>MALL  +  Il,(K-l)J«MSMALL+Jl)  = 
£( (K-l)*MSMALLtI it (K-1)*MSMALL+J1)-S; 
END: 
END; 
END; 

/*  DO  LU  DECCMP  CF  D  */ 
CALL  IU(M  (K-l ) *MSMALL>  1 : K*MSMAL L , (K-l )*MSMALL*L: 

K*MSMALL)  , MSMALL)  *, 
/♦OCOMP.  INV(LD),  INV(UDJ  I    STOKE  IN  PAGE  <N+1,1>   */ 
DC  Kl=l  TC  MSMALL; 
SliA(NSMALL+Kl ,K1 ) = 1/ A ( ( K- 1 ) *MSMAL L  +  Kl , 

<K-1)*MSMALL*K1) ; 
DO  J=l  TO  Kl-l; 
S=0; 

DC  I=J  TC  Kl-l; 
S=S+A(  (K-l)*MSMALL+Kl,  (K-i  )*MSMALL+I  J* 
A(NSMALL*IfJ); 
END; 

A(NSMALL+K1, J)=-S*S1; 
END; 

DO  J=Kl*l  TC  MSMALLJ 
S  =  A( (K-1)*MSMALL  +  K1, ( K- 1 ) *M SMALL* J  ); 
DO  I=K1*1  TC  J-l; 
S=S+A(  (K-1)*MSMALL«-K1,  (K-l  )*MSMALL*I)* 
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A(NSMALL«-I,JJ; 
END; 

A(NSMALL*K1, J)=-S; 
END; 
END; 

/*  COMP.  P=P-Q  *  X 
/♦REVERSE  THE  FOLLW.LUO 
IF  MOD(K,2)=l  THEN 
DO;ILL=K«-l  ;IHL=N;ISTE 
DC;ILL=N;IHL=K*1;ISTEP 
DO  I=ILL  TG  IHL  BY  ISTE 
/*REV.  THE  FOLLOW. LOOP 
IF  M0D(I,2)=i  THEN 

dg;kkl=i;kkh=k-i;kkst 
dg;kkl=k-i;kkh=i;kks7 
dc  kk=kkl  tg  kkh  by  kk 
DG  11=1  TO  msmall; 
DC  Jl=l  tc  msmall; 
S=0; 

DO  KK1=1  TL  MSMALL; 
S  =  S*M  II-1)*MSMALL 
M (KK-1)*MSMAL 
END; 

A( (I-1)*MSMALL+Ilt< 
A( (I-1)*MSMmLL*I1,( 
end; 
end; 
END; 
end; 
/*  comp.  p  =  p*(  inv(uo)*-i 

DCL  S(MSMALL)  FLUAT  /*T 
OC  I=K*MSMALL+1  TG  NSMA 
DG  J=l  TC  msmall; 

S( J)=A( I , (K-l )*MSMALL 
END; 

CC  J=l  TG  msmall; 
S  =  G; 
DC  Kl=l  rc  j-i; 

S=S+S(K1)*A<NSMALL+K 
END; 

£<  i  ,<k-1)*msmall+j)=a 
end; 
END; 

/♦COMP.    2=?~R*6 
/♦REVERSE    THE    FOLLOW. lG 
IF    MCD(K,2)=1    THEN 

Dc; jll=k*i; jhl=n; JST 
DU; JLL=N; JHL=K+l; JST 
DC    J=JLL    TO    JHL    BY    JSTE 
/♦REV. THE    FOLLOW. LOGP 


♦  / 

p  with  re.   to  k  ♦/ 

p=i;end;else 

=-i;tND; 

p; 

WITH    RE.     I  ♦/ 


EP=l;END;ELSE 
ep=-i;end; 

STEP; 


♦lit (KK-D^MSMALL+KKU^ 
L+KK1,(K-1)^MSMALL*J1); 

K-1MMSMALL+J1)  = 
K-U^MSM*LL+Ji)-S; 


)      ♦/ 

EMP.  STORAGE  VECTOR  ♦/ ; 
LL; 

♦J); 


if  J); 

(I  t  (K-l)^MSMALLO)+S; 

♦  / 
uP  WITH  RE. TO  K*/ 

ep=i  ;end;elSE 

EP=-l;END; 

p; 

WITH  RE.  TO  J   ♦/ 


180 


IF  MOD 

do;kk 

oc;kk 

DC  KK  = 

DC  11 

DC  J 

3  =  0 

DG 

S  = 

END 

A(  ( 

A(  ( 

END; 

end; 

END; 

END; 

/*  CGMP 

DC  J=K* 

DC  Ki  = 

S(Kl) 

END; 

DG  11  = 

S=o; 

DC  Kl 

S=S  + 

END; 

A(  IK- 

END; 

END; 

END  LUSR 

(J,2)=l  THEN 

L = i ; KKH= K- I ; KK ST EP= I ; END ; EL SE 

L=K-l;KKH*l;KKSTEP»-i;END; 

KKL  TU  KKH  BY  KKSTEP; 

=1  TC  msmall; 

1=1  TG  msmall; 

• 

KKl=i  tc  msmall; 

S  +  A<  (K-i  )*MS  MALLEI  It  (  KK-1 )  *MSMALL«-KK1 )  * 
A(  (KK-1)*MSMALL*KK1,( J-l) *MSMALL+ J U  ; 

K-l)*MSMALL*Ilt  (  J-1)*MSMALL+J1)= 
K-1)*MSMALL+Ilf ( J-l)*MSMALL+Jl)-S; 


UTE  Z=INV(LD)*Z  */ 
MSMALL+1  TG  NSMALL; 

i  to  msmall; 

=A( (K-1)*MSMALL*K1, J); 

l   to  msmall; 

=1  tc    11; 

a(nsmall+i  1,ki)*s(k1) ; 

1)*M$MALL*U,J)=S; 


fp; 
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C.3  Gaussian  Elimination 

gossp fp:proc( a, nsmallt msmall); 

/♦this  proc  dces  gaussian  elim.on  matrix  a(of  oroep  */ 
/♦nsmaldusing  submatrix  alg.with  lcop  reversal  i  ♦  / 
/♦also  using  freeing  £  prepaging  ♦  / 

n*nsmall/msmall; 

/♦  preload  as  many  pages  of  a  as  possible  ♦  / 
/♦desirable  only  when  page  allot. >2^n    ♦/ 

DO  J  =  l  TC  N; 
DO  1=1  TO  N; 
CALL  PREPAGE(AtltJ); 
END;end; 
DO  K=l  TO  N; 

/♦MAJOR  LOOP  OF  THE  ALGORITHM  ♦/ 

/♦  FREE   THE  DEAD  SUBARRAYS  ♦/ 
DO  1=1  TC  K-l; 
CALL  FREE(AiIfK-l); 
CALL  FREECA.K-1, I) ; 
END; 

/♦  DO  GAUSSELIM  OF  D  ♦/ 
CALL  GOS(A((K-i)^MSMALL*i:K+MSMALL,(K-l)+ 

MSMALL«-l:K^MSMALL),  MSMALLJ; 
/♦COMPLETE  ELIM.  ON  SUBARRAYS  R  £  C    ♦/ 
DO  J=l  TO  MSMALL; 
/♦COMPUTE  C<2>=C<2>/D<2,2>  ♦/ 

/♦REVERSE  LOOP  WUTH  RE.  TO  J  ♦/ 

IF  M0D(J,2I=1  THEN 
DO;  ILL  =  MSMALL^K  +  1 ;IHL=NSMALL ; I STEP*1 ; END;ELSE 
DO;  ILL  =  NSMALL;lHL=K^MSMALL«-l;ISTEP--l;END; 
DO  I=ILL  TC  IHL  BY  ISTEP; 
A(  I,<K-1)^MSMALL*J)=A(I  ,( K- 1 ) ♦MSMALL* J ) / 

A( (K-l) ♦MSMALL * J , ( K- i ) ♦MSMALL* J ) ; 
END; 

/♦  C<3>=C<3>-C<2>+D<1,3>      ♦/ 
/♦PEVERSE  THE  FCLOW.LOOP  OPP.  TO  PREV.LOP+/ 
IF  MCD(J,2)=0  THEN 
DC;ILL  =  K«-l;  IHL  =  N;ISTEP=i;END;ELSE 
DO; ILL=N; IHL=k+i; ISTEP=-i;end; 
DO  I=ILL  TO  IHL  BY  ISTEP; 
DC  Jl=J*i  TO  MSMALL; 
DO  1 1= 1  TC  MSMALL; 
A(  (  I-1KMSMALL*I1,(K-1)^MSMALL*J1)«A(  (I-l)^ 
MSMALL*  I  i  ,  (K-1)+MSMALL+J1)-A(  ( I-  I)  ♦MSMALL  *I  I , 
(K-1MMSMALL*J)^A(  (  K-l )  ♦MS  MALL*  J,  (K-l)  ♦MSMALL 

*Ji); 

END; 
END; 
END; 
/♦  R<3>=R<3>-D<3,2>^R<2>       ♦/ 

/♦  REV  WITH  RE.  TO  J  ♦/ 

IF  MOD(J,2)=l  THEN 
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DO;JJL»K«-l; JJH*N;JJSTEP»l; END; ELSE 
DO;JJL»N;  JJH*K+1;JJ STEP— 1;  END; 
DO  JJ=JJL  TO  JJH  BY  JJSTEP; 
DO  I1=J+1  TO  msmall; 
DO  JJl*i  to  msmall; 
A(  (K-1)*MSMALL*U,  (  JJ-1  >*MSMALL*JJ  II* 
A((K-1)*MSMALL*I1»( JJ-1 >*MSMALL+JJ 1 )- 
A(  <K-n*MSMALL*Ilt<K-ll*MSMALL*J)    * 
A((K-l)*MSMALL*Jt <JJ-1J*MSMALL*JJL); 
END; 

end; 
end; 
end; 

/*  m=M-C*R  */ 

/*REV  WITH  RE.  TO  K  */ 

IF  MOD(K,2)=l  THEN 

DO;ILL=K*l;lHL=N;ISTEP=l;END;ELSE 

DO;ILL=N;IHL=K>i;lSTEP=-i;END; 
DC  1=1 LL  TO  IHL  BY  istep; 

/*REV  WITH  RE.  TO  I  */ 

IF  MODU  ,2)=1  THEN 
DC; JLL=K*1; JHL=N;JSTEP*l;ENO;ELSE 
DO; jll=n; jhl=k+i; jSTEP=-i;END; 

DO  J  =  JLL  TO  JHL  BY  JSTEP; 
DO  II  =1  TO  msmall; 
DO  Jl=l  TO  msmall; 
S=o; 

DC  Kl  =1  TO  msmall; 
S=S*A(U-1)*MSMALL«-I1i<K-1)*MSMALL«-K1)* 
A(  (K-1)*MSMALL«-K1,(  J-l )*MSMALL*Jl) ; 
END; 

A(  (I-1)*MSMALL*I1  f  (  J-1)*MSMALL«-JU  = 
A(  (  I-1)*MSMALL  +  U,<  J-1I*MSMALL*J1I-S; 
END  GOSSRFP; 
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C.J+     Gram-Schmidt  Orthogonalization 

ORTHSRFP:PROC(A,R,NSMALLlf ASMALL2 f MSMALL I ; 
/♦THIS    PROC    FACTORS    MATRIX    A(OF    SIZE    NSMALLi    X    NSMALL2*/ 
/♦)     INTO    V   ♦    R,    V    HAS    ORTHONORMAL    COLUMNS    &    R    IS    UPPER*/ 
/♦    TPIAGULAR  ♦  / 

N1=NSMALL1/MSMALL;N2=NSMALL2/MSMALL; 

/♦PREPAGE  ARRAY    A, ADVISABLE    IF    PAGE    ALLOT. >2*N1*/ 

DO    J  =  l    TO    N2; 
DO    I«l    TO    Nl; 
CALL    PREPAGE(A,I,J); 

END;END; 

DO    K*l    TO    N2; 
/♦    FREE    ♦  / 

DO  1  =  1  TO  Nl  WHILE(K>1) ; 
CALL  FREE(AtltK-l) ; 

END; 

DO  J=K-l  TG  N2  WHILE(K>i); 

CALL  FREE(R,K-i,J) ; 
END; 

/♦  PREPAGE  ♦  / 
DO  J=K  TO  N2; 

CALL  PREPAGE(R,K,J|; 
END; 

/♦  GPTHO.  OF  M  VECTORS  A<*,J)  FOR  K*M>* J>( K-l )*M  ♦/ 
DO  Kl  =1  TG  MSMALL, 
/♦  COMP.U-I  J*M*-Kl  TH.  COL.  OF  R   ♦  / 
/♦REV. LOOP  WITH  RE. TO      Kl       ♦  / 
IF  MOD(Kl,2)=l  THEN 
DC;ILL=l;IHL=Nl;ISTEP=l;ENO;ELSE 
DO;ILL=Nl;IHL=l;ISTEP=-i;ENU; 
DC  I=ILL  TO  IHL  BY  ISTEP; 
/♦REV. LOOP  WITH  RE.  TO     I       ♦  / 
IF  MOD( I,2)=l  THEN 
DO;JLL=i;JHL=Kl-l;JSTEP=l;END;ELSE 
DO; JLL=Kl-i;JHL=l;JSTEP=-l;END; 
DC  J  =  JLL  TO  JHL  BY  JSTEP; 

S=o; 

DO  11=1  TO  MSMALL; 
S=S*A(<I-1I*MSMALL«-U,(K-1)*MSMALL«-J)* 
A((I-l)*MSMALL^Iit<K-l)*MSMALL*J); 
END; 

P<  <K- 11 ♦MSMALL *Jt  (K-l)  ♦MSMALL «-Kl)« 
R«(K-l)*MSMALL*J,(K-l)*MSMALL-»-Kl)+S; 
END; 
END; 

/♦COMP.  (K-1I*M*K1  TH.  COL  OF  A  ♦  / 
/♦REVERSE  WITH  RE.  TO  Kl         */ 
DO  I  =  IHL  TO  ILL  BY  -ISTEP; 
DO  11=1  TO  MSMALL; 

S  =  0; 
/♦REVERSE  WITH  RE.  TO  II        ♦/ 
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IF  M00( 11,21=1  THEN 
DC;J1LL*1;JIHL»KI-1;J1STEP=1;END;ELSE 
D0;J1LL*ki-1; J1HL=1;J1STEP=-1;ENC; 
DO  J1=J1LL  TO  J1HL  BY  J1STEP; 
S=S*A(< I-l)*MSMALL*ILf ( K-l l*MSMALL*J I ) * 
R<  (K-l)*MSMALL<-Jl,(K-l)*MSMALL*Kl); 
END; 

A(  (  I-l)*MSMALL^Ii,(K-U*MSMALL*Kl)« 
A(  (  I-1)*MSMALH-I1,(K-1)*MSMALL  +  K1)-S; 
END; 
END; 
/♦NORMALIZE  JUST  COMPUTED  COL  OF  A  ♦  / 

S=o; 

DO  1=1  TO  NSMALLl; 

S=S*A(  It <K-1J*MSMALL+K1)*A(I, ( K-l )*MSMALL*K1 1 ; 
END; 

S=SOPT(SJ;R( < K-l ) ♦  MS MALL *K1, (K-l J ♦MSMALL+K I i =  S; 
/♦REVERSE  LOOP  PERMANENTLY     */ 
DO  I=NSMALL1  TO  1  BY  -15 

A(I,  (K-li*MSMALL  +  Kl)=A(I,  (K-l)*MSMALL-»-Kl)/S; 
END; 
end; 

/♦COMPUTF  R<2,3>=A<2>*  *  A<3>   ♦/ 

/♦REV  LOOP  WITH  RE.  TO  K        ♦  / 
IF  M0D(K,2)=1  THEN 

DO;ILL=l;IHL=Nl;ISTEP*l;END;ELSE 

DO;ILL=Nl;IHL=l;ISTEP=-l;ENO; 
DO    I=ILL    TO    IHL    BY    ISTEP; 

/♦REV  LOOP  WITH  RE.  TO  I       ♦/ 

IF  MCO( 1,21*1  THEN 

dc;kkl=k*i;kkh=N2;kkstep=i;END;else 
dc;kkl=N2;kkh=k«-1;kkstep=-1;END; 
DO  KK=KKL  to  kkh  by  kkstep; 

DC  KK1=1  TO  MSMALL; 
DO  Jl=l  TO  msmall; 
S=o; 

DC  11=1  TO  MSMALL; 
S=S+AIU-1>+MSMALL«-I1,  (  K-l ) ♦MSMALL*  J 1  )  ♦ 

A(  <I-i)^MSMALL*Il,  (KK-1)^MSMALL+KK1I  ; 
END; 

R{  (K-1)^MSMALL*-J1,(KK-1)^MSMALL*KK1)*S«- 
R( (K-U+MSMALL+J1, <  KK-1  )^MSMALL«-KK1 ); 
END; 
END; 
END; 
END; 

/♦COMPUTE  A<3>=A<3>  -  A<2>  ♦  R<2,3>      ♦/ 
/♦REV  IN  A  DIRECTION  OPP.  TO  PREV.  LOOP  ♦/ 
DO  I=IHL  TO  ILL  BY  -ISTEP; 
/♦REV  WITH  RE.  TO  I     ♦/ 
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IF  MODI  1.21=1  THEN 
DC; JLl=K*l; JHL=N2;JSTEP=i;END;ELSE 
DO;JLL=N2;  JHL  =  K«-l;JSTEP=-i;END; 
DO  J  =  JLL  TO  JHL  BY  JSTEP; 
DO  11=1  TO  MSMALL; 
DO  Ji=l  TO  MSMALL; 
S  =  o; 

DC  Kl=l  TO  MSMALL; 
S=S+A( (I-1)*MSMALL*I1, <K-1 )*MSMALL*K  I )  * 
R<(K-1)*MSMALL+K1,(J-1)*MSMALL*J1); 

END; 

A( (I-1)*MSMALL«-I1,< J-1)*MSMALL+J1)= 

A(  (  I-1)*MSMALL*I  1,(  J-1)*MSMALL  +  JU-S; 

END  OPTHSRFP; 
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C.  5     Matrix  Multiplication 

MMSRF P:PROC< A, B.CfNSMALL, MSMALL) ; 

/♦    THIS    PROC    MULTIPLIES    MATRICES    B    &    C    &    STORES    ♦  / 

/♦    INTO    A,    BY    SUBMATRICES;USES    RE VER S AL , FREE ING    ♦  / 

/♦  6  prepaging  ♦  / 

n=nsmall/msmall; 
/♦prepage  as  much  cf  array  c  as  possible      ♦  / 

/♦DESIRABLE  IF  PAGE  ALLOT.  >  2*N  ♦  / 

00  J=L  TC  N; 
DC  1=1  TO  n; 
CALL  PREPAGE(C, I tJ) ; 
END;END; 

/♦MAIN  LOOP  CF  ALG.    ♦  / 
DO  1=1  TO  n; 
/♦FREE  DEAD  PORTIONS  OF  A  &  B    ♦/ 
DC  11=1  TC  N  WHILE( I>1) ; 

CALL  FREECA.I-1,  I1MCALL  FREE  (  B,  1-1 ,  1 1 )  \ 
END; 

/♦PREPAGE  PCRTIONS  CF  A  £  B      ♦/ 
DC  11=1  TC  n; 
CALL  PREPAGE(B,I, II); 
CALL  PREPAGEUt  If  II); 
END; 

/♦REV  LOOP  WITH  RE.  TO  I         ♦/ 
IF  M0D(I,2)=1  THEN 
DO; JLL=l; JHL=N; JSTEP=L ;END;  ELSE 
DC;  JLL=N;JHL=l;JSTEP=-l; END; 
DO  J=JLL  TC  JHL  BY  JSTEP; 
/♦REV  LOCP  WITH  RE   TC  J        ♦/ 
IF  M0DU,2)  =  1  THEN 
DO;KLL=l;KHL=N;KSTEP=l;END;  ELSE 

do;kll=N;khl=1;kstep=-1 ;END; 

DQ  K=KLL  TC  KHL  BY  KSTEP; 
DC  11=1  TC  MSMALL; 
DC  Jl=l  TC  MSMALL; 

s  =  o; 

DO  Ki=l  TO  MSMALL; 
S  =  S  +  B(  (1-1)  ♦MS  MALL -Hit  (K-l  )*MSMALL+K1 )♦ 
C((K-1)^MSMALL+K1, ( J-l ) ♦MSMALL* J  1 ) ; 
END; 

A(  (I-1)*MSMALL  +  Ilf  (  J-1)^MSMALL+J1)=S+ 
A( (I-1)*MSMALL+Ilt ( J-l) ♦MSMALL* J  1) t 
END  MMSRFP; 
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C.6   CDS 


CDS:PROC(A,NSMALLtMSMALL) ; 

/♦THIS  PROC  FACTORS  SYM.POS.DEF.  MATRIX  ACOF  ORDER*/ 
/♦NSMALL)  INTO  G*G«  ,G  IS  LCWER  TIANG.THIS  IS  A  */ 
/♦SUBMATRIX  ALGORITHM  */ 

N=NSMALL/MSMALL; 
DO  1=1  TO  n; 
/*  COMPUTE  D=D-R*R»   */ 
DO  J=i  TO  I-l; 
DO  11=1  TO  msmall; 
DO  Jl=l  TO  11; 
S=o; 

DO  Kl=l  TO  MSMALL; 
S=S*A(  (I-1)*MSMALL+Ilf  (  J-l )*MSMALL  +  K1 )  * 
A( ( I-1)*MSMALL*J1,( J-1)*MSMALL+K1); 
END; 

A( ( I-1)*MSMALL+I1, (I-1)*MSMALL+J1)= 
A(  (  I-1)*MSMALL"U1,<I-1)*MSMALL+J1>-S; 
END; 
END; 
END; 

/*  DO  DECOMPOSITION  OF  DIAG.SUBM.  D        */ 
CALL  CD(A( ( I-l)*MSMALL*l:I*MSMALL, 

(  I-  l)*MSMALL«-l:I*MS  MALL),  MSMALL); 
/♦INVERT  SLT  PART  OF  D  L    STORE  IN  SUT  PART  OF  D*/ 
DO  K=2  TO  MSMALL; 
Si=A(  (  I-1)*MSMALL«-K,(I-1)*MSMALL+K); 
DO  11=1  TO  K-l; 
S  =  A(U-1)*MSMALL*K,  (I-l  >*MSMALL+I 1 )/ 

A( <I-1)*MSMALL+I1,(I-1)*MSMALL+I1); 
DO  J=I1+1  TO  K-l; 
S=S+A< ( I-1)*MSMALL+K,( I-l ) *MSMALL*J )* 
A(( I-l )*MS MALLEI  If (I-1)*MSMALL*J); 
END; 

A(( I-1)*MSMALL*I1,<I-1)*MSMALL*K)=S/Sl; 
END; 
END; 

/♦COMPUTE  C=C-M*R»   */ 
DO  J=I  +  1  TO  n; 
DO  K=l  TO  I-l; 
DO  Jl=l  TO  MSMALL; 
DO  11=1  TO  msmall; 
S=0; 

DO  Kl=l  TO  MSMALL; 
S=S*A( ( J-1)*MSMALL+J1, (K-l ) *MSMALL+K1 )* 
A(  (  I-1)*MSMALL«-I  If  (K-1)*MSMALL*K1)  ; 
END; 

A(  (J-1)*MSMALL*J1,(  I-1)*MSMALL-H1)  = 
A( (J-1)*MSMALL+J1,( I-1)*MSMALL+I1)-S; 
END; 
END; 
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END; 
END; 

/*    MULT    G<J,I>    BY    G<ItI>    INVERSE    */ 
DCL    S(MSMALL)     FLOAT;    /*TEMP. VECTOR    */ 
DC    J=I+1    TO   n; 
DO    J  1=1    TO    MSMALL; 
DO     11=1    TO    MSMALL; 

S(I1)=A( (J-i)*MSMALL+Jl, { 1-1 ) *MSMALL*I 1J ; 
END; 

DC    11=1    TO    MSMALL; 
S  =  S  (  I  U/A(  {  I-l)*MSMALL*Ilt  ( 1-1 > *MSMALL*I 1 )  ; 
DO   Kl=l   TO    I l-l; 

S=S*S<K1)*A<(  I-1)*MSMALL«-K1,U-1)*MSMALL*I1); 
END; 

A(  (J-l)*MSMALL*Jlt<  1-1  )*MSMALL  +  I1)  =  S; 
END    CDS; 


189 


APPENDIX  D 
ALGORITHM  AP 

ALG-AP: 

/*  Constructs  the  USE-LIST  and  uses  it  to  insert  instructions  in  an 
OL/2  program  */ 
Step  AP-1  [Data  Structure]: 

[A]:  A  node  in  the  USE-LIST  is  one  of  the  seven  types:  ARRAY,  SUBARRAY, 
FOR-BEGIN,  FOR-END,  DEAD -SUBARRAY,  END-OF-LIFETIME,  END-OF-SCOFE. 

(1)  A  node  of  type  ARRAY  has  three  fields:  NAME,  LP  and  NEXT. 
NAME  is  a  pointer  to  the  ACB  of  the  array,  LP  is  the  lifetime 
pointer  of  the  array,  and  NEXT  is  the  static  link  in  the  USE-LIST. 

(2)  A  node  of  type  SUBARRAY  has  the  following  fields:  NAME,  X,  KC, 
NEXT,  LP. 

NAME  is  a  pointer  to  the  ACB  of  the  subarray,  X  consists  of  four 
subf ields :  r   ,  r  ,  c   ,  c  ,  each  indicating  a  boundary  expression 

_L  —  -L    _L    u  —  -L    J 

of  the  subarray.   NEXT  and  LP  are  as  in  (l)  above.   KC  is  a 
(possibly  empty)  set  of  subf ields,  each  threaded  on  a  dynamic  link 
(k-chain)  of  some  FOR  loop  (and  indicating  that  this  subarray  is 
dynamic  with  respect  to  that  loop). 

(3)  A  node  of  type  FOR-BEGIN  has  three  fields:  NAME,  KC  and  NEXT. 

NAME  contains  some  unique  identifier  of  the  FOR  loopj  we  have  assumed 
that  the  loop  control  variable  is  such  an  identifier.   NEXT  is  as  in 
(l)  above.   The  field  KC  originates  the  k-chain  of  the  loop. 
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(k)     A  node  of  type  FOR-END  has  three  fields:  BP,  NEXT  and  KC. 

BP  is  a  back-pointer  to  the  associated  FOR-BEGIN  node,  NEXT  is 
as  in  (l)  above,  KC  is  the  end  of  the  k-chain  of  the  loop. 

(5)  A  node  of  type  DEAD-SUBARRAY  has  three  fields:  NAME,  X  and  KC. 
All  these  three  fields  are  as  in  (2)  above. 

(6)  A  node  of  type  END- OF- LIFETIME  has  two  fields:  RP  and  NEXT. 
RP  is  a  return-pointer  to  the  ARRAY  or  SUBARRAY  node  of  this 
(sub)array.   NEXT  is  as  in  (l)  above. 

(7)  A  node  of  type  END-OF-SCOPE  has  two  fields:  NAME  and  NEXT. 
NAME  is  the  name  of  the  array  and  NEXT  is  as  before. 

[B]:    The  ACB  of  each  array  is  modified  to  have  two  additional  pointer 
fields,  IP  and  CP. 

[C]:    Definitions :   (Scope  and  lifetime) 

(1)  S(A)  =  if  A  is  of  type  ARRAY  then  (  min  IP[B],  max  LP[B]); 

BCA         BgA 

else  if  A  is  of  type  FOR-BEGIN  then 

(pointer  to  FOR-BEGIN[A] ,  pointer  to  FOR-END[A]); 

(2)  L(A)   =  if  A  is  of  type  ARRAY  or  SUBARRAY  then  (lP[A],  LP[A])j 

Step  AP-2  [  Build  up  the  USE-LIST  -during  the  syntax  analysis  phase] : 
[A]     if  SCAN  =  'FOR'   then 
do; 
Create  a  node  of  type  FOR-BEGIN,  pointed  to  by  P,  and  link  it 
to  the  USE-LIST  (on  the  static  link);   set  NAME[P]  =  'k*; 
KC[P]  =  null; 
end; 
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[B]  if     SCAN     =      'A'      /*  an  array  */     then     if  CP[A]    =  null     then 

/*  the  first  occurrence  of  A  */ 

do; 
Create  a  node  of  type  ARRAY,   pointed  to  by  P,    and  link  it  to 
the  USE-LIST;    set  IP[A]    =  P;    CP[A]    =  P;    NAME[P]    =  A;    LP[  P]    =  null; 

end;  else     /*  not  the  first  occurrence     */     CP[A]    =  P; 

[C]  if     SCAN  =   'B'      then       /*  a  subarray,   B  =  A<I,J>  */ 

if  CP[fi]    =  null  then     /*  first  occurrence  */ 
do; 
Create  a  node  of  type  SUEARRAY,   pointed  to  by  P,    and  link  it 
to  the  USE-LIST;    set  IP[B]    =  CP[B]    =  P;      NAME[P]    =  B;      LP[P]=null; 
X[P]    =   (rI_1,   tx,   cJ_1,    Cj);      KC[P]    =  null; 

end;  else     /*  not  first  occurrence  */     CP[B]    =  P; 

[D]  if     SCAN  =   'END'      then     /*  end  of  a  FOR  loop  */ 

do; 
Create  a  node  of  type  FOR-END,   pointed  to  by  P,    and  link  it 
to  the  USE-LIST;    link  this   node  to  the  k-chain  of  the  corresponding 
FOR  loop  and  set  BP[  P]      =     pointer  to  the  matching  FOR-BEGIN  node; 

end; 

Step  AP-3  [  This  step  modifies  the  USE-LIST  to  add  dead  subarrays,  lifetime 
pointer,  correct  lifetimes  and  end  of  scope  nodes.   This  is 
done  in  a  post-syntax-analysis,  precoding  phase]: 
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[A]  [fill  up  lifetime  pointers   and  create  end-of-lifetime  nodes]: 
Scan  the  USE-LIST   sequentially. 

if     SCAN  =     ARRAY[P]      or   SUBARRAY[  P]      then 

do; 

B     =     NAME   [P]j      LP[P]      =     CP[B];      Create   a  node  of  type 
END-OF-LIFETIME,   pointed  to  by  Q,    and  insert  in  the  USE-LIST 
at  LP[P]j    set  RP[Q]    =  P; 

end; 

[B]  [  link  dynamic  suharrays  on  the  k-chain  ]: 
Scan  the  USE-LIST  sequentially. 

if  the  current  nest  level  1  >  0  then  if  SCAN  =  SUBARPAY[P]  then 
do; 

B  =  NAME  [P]; 

for  all  'FOR'  loops  f  do; 

if  S(f)  H  L(B)  £     cp  then 
do; 

k  =  NAME[f]; 

if  P(B,  k)  =  T  then  link  B  to  the  k-chain  of  loop  f; 
end; 
end;  end; 

[C]  [modify  the  USE-LIST  to  take  care  of  the  overlapping  lifetimes, 

introduce  end-of- scope  node,  enter  dead- suharrays] : 
(l)   [modify  overlapping  lifetimes]: 

for  all  arrays  A  such  that  S(A)  ^  cp,  do; 

/*  let  1  be  the  level  of  a  sub-array  B  of  A.   Here  the  level 
means,  the  level  in  ACB  tree  structure  */ 
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1    .  =       min       1    (B):        1  =       max       l(B)    ; 

mn  BCA  maX  BCA 

/*  note  that  l(A)  ■  1,   1  (A  <L,  2>)  =  2  */ 

for  j=l     to  1  .   -1,   do: 
max      min 

for  all  D^A  such  that  l(D)  =  j  do; 

let  Q  be  the  pointer  to  the  node  D  in  the  USE-LIST;  let 

the  lowest  level  ancestor  of  D  in  the  USE-LIST  be  denoted 

by  E  and  P  points  to  its  node  in  the  USE-LIST; 

if   (IP[E]   >  IP[D]    )   and   (   LP[P]   >     LP[  Q]    )      then 

do; 

CP[D]    =     LP[Q]    =  LP[P];      delete   the  END-OF-LIFETIME 

node  of  D; 
end; 

if   (   LP[P]   <  LP[Q]    )   then 
do; 

CP[E]    =  LP[P]    =  LP[Q];    delete   the  END- OF- LIFETIME 
node  of  E;    RP[Q]    =  P; 
if      (IP[E]   <  IP[D]    )   then 
do; 

delete  node  D  from  the  USE-LIST;    CP[D]    =  IP[D]    =  null; 
end  [C]    (1); 
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(2)  [insert  the  END-OF-SCOPE  nodes]: 

for  all  arrays  A  such  that  there  is  a  subarray  B  of  A 

such  that  B  €  USE-LIST    do; 

set   (P,Q)  =  S(A);  Create  a  node  of  type  END-OF-SCOPE 
and  insert  it  in  the  USE-LIST  at  Q;  NAME[Q]  =  A; 

end; 

(3)  [create  dead  subarrays  and  link  to  USE-LIST]: 

for  all  arrays  j.   USE-LIST  such  that  there  is  a  subarray  B 
of  A  such  that  B  e  USE-LIST  do; 

j=aiax  fl(B)   BCA,  B  e  USE-LIST,  there  is  a  k 

such  that  P(B,k)  =  T  }  ; 
construct  a  sequence  of  subarrays  of  A  at  level  j; 
name  the  above  sequence  DEAD-LIST (A); 
if  (B  e  DEAD-LIST (A))  and  ( (B  e  USE-LIST)  or 

(  for  all  k  P(B,k)=F))  then  delete  B  from  DEAD-LIST(A); 
for  each  FOR  loop  f  do; 
for  each  B  e  DEAD-LIST (A)   do; 
let  the  loop  control  variable  of  loop  f  be  k; 
let  Q  point  to  FOR-BEGIN  node  of  f; 
if  S(f)  O  S(A)  f   cp  then  if  P(B,k)  =  T  then  do; 
create  a  node  of  type  DEAD -SUBARRAY,  pointed  to  by  P, 
and  link  it  to  the  k-chain  of  f; 
NAME[P]  =  B;      X[P]  =  (^^  ,  ^  ,  Cj-1  ,  Cj)  ; 
end; 
end  Step  AP-3; 
Step  AP-k     [use  of  the  USE-LIST  in  the  coding  phase  to  do  freeing 
and  prepaging] : 
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[A]  [freeing]: 

if  at  lexical  level  1=0   then  on  scanning  an  END-OF-LIFETIME 
node  of  B  do; 

if  B  is  an  array  then  generate  instruction:  'FREE(b)'; 

if  B  is  a  subarray  then  generate  instructions: 
'IF  for  all  x  €  X,  MOD(x,m)=0  THEN  FREE(b)'; 
end; 
on  scanning  an  END-OF-SCOPE  node  of  A  generate  the  instruction: 

'FREE (A)'; 
if  at  level  I   >  0  then  on  scanning  an  END-OF-LIFETIME  or 
END-OF-SCOPE  node  of  B  do  the  same  as  above  except  that  the 
code  is  inserted  after  the  level  £   is  reduced  to  zero,  i.e. 
after  exit  from  the  outermost  loop; 

[B]  [prepaging  outside  the  loop]: 

if  at  level  £  =  0  then  generate  instructions  to  prepage  a  few 

nodes  in  advance; 
if  at  level  I   >  0  then  prior  to  generating  code  for  the  loop, 

generate  instructions  to  prepage  all  the  (sub) arrays  in  the 

USE-LIST  from  the  FOR-BEGIN  node  to  the  FOR-END  node  along 

the  static  link; 

[C]  [freeing  within  a  loop]: 
(level  i  >   0  ) 

for  all  dead- sub arrays  B  along  the  dynamic  chain  of  the  loop, 

generate  the  following  code  just  after  the  backward  branch 

target  of  the  loop: 

'IF  ((for  all  x  e  X  ,   f  (x)  <  0)  AND 

1    *i+l 
(for  all  x  e  X  ,       f    (x)  >  0))  THEN 

l+l 
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IF  for  all  x  e  X,MOD(  x(k   ),m)  =  0  THEN  FREE(b)'  ; 
[D]  [incremental  prepaging  within  a  loop]: 
(level  I  >   0) 

for  all  dynamic  subarrays  B  along  the  dynamic  chain  of  the 
loop,  insert  the  following  code  after  the  backward  branch 
target  of  the  loop: 

'IF  ((there  exists  x  e   X,  such  that  f,    (x)  =  -l)  OR 

1  k.  , ., 

l+l 

(there  exists  x  e  X  such  that  f    (x)  =  1))  THEN 

2  k.  , 

*  i+1 

IF  there  exists  x  e  XQ  such  that  M0D(  x(k.),m)  =  0 
THEN  PREPAGE(B)   '  ; 
end  ALG-AP: 
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